有关这些书籍和其他曼宁书籍的在线信息和订购,请访问 www.manning.com。出版商在批量订购这些书籍时提供折扣。
For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.
欲了解更多信息,请联系
For more information, please contact
特约销售部
Special Sales Department
曼宁出版公司
Manning Publications Co.
鲍德温路 20 号
20 Baldwin Road
邮政信箱 761
PO Box 761
谢尔特岛, NY 11964
Shelter Island, NY 11964
电子邮件:orders@manning.com
Email: orders@manning.com
©2021 年 Manning Publications Co.保留所有权利。
©2021 by Manning Publications Co. All rights reserved.
未经出版商事先书面许可,不得以任何形式或通过电子、机械、影印或其他方式复制、存储在检索系统中或传播本出版物的任何部分。
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
制造商和卖家用来区分其商品的许多名称都声称是商标。如果这些名称出现在书中,并且 Manning Publications 知道有商标索赔,则这些名称已以首字母大写或全部大写形式印刷。
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
♾ 曼宁认识到保存所写内容的重要性,因此将我们出版的书籍打印在无酸纸上是合理的,为此我们尽最大努力。曼宁的书籍也认识到我们有责任保护地球的资源,因此这些书籍的印刷纸张至少有 15% 的回收和加工不使用元素氯。
♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
|
|
曼宁出版公司 Manning Publications Co. 鲍德温路 20 号技术 20 Baldwin Road Technical 邮政信箱 761 PO Box 761 谢尔特岛, NY 11964 Shelter Island, NY 11964 |
|
开发编辑: Development editor: |
玛丽娜·迈克尔斯 Marina Michaels |
|
技术开发编辑: Technical development editor: |
克里斯托弗·豪普特 Christopher Haupt |
|
评论编辑: Review editor: |
Aleksandar Dragosavljevic ́ Aleksandar Dragosavljevic´ |
|
生产编辑: Production editor: |
迪尔德丽·希姆 Deirdre S. Hiam |
|
文案编辑器: Copy editor: |
弗朗西斯·布兰 Frances Buran |
|
校对员: Proofreader: |
杰森·埃弗雷特 Jason Everett |
|
技术校对员: Technical proofreader: |
段阿兰 Tran Tuan A. Tran |
|
排字工人: Typesetter: |
丹尼斯·达林尼克 Dennis Dalinnik |
|
封面设计师: Cover designer: |
玛丽亚·都铎 Marija Tudor |
国际标准书号: 9781617296468
ISBN: 9781617296468
感谢我的妻子 Peggy,她不仅支持我的高性能计算之旅,还支持我们的儿子 Jon 和女儿 Rachel 的旅程。科学编程与她的医学专业知识相去甚远,但她陪伴着我,让它成为我们的旅程。感谢我的儿子 Jon 和女儿 Rachel,他们重新点燃了火焰,也祝愿你们充满希望的未来。
To my wife, Peggy, who has supported not only my journey in high performance computing, but also that of our son Jon and daughter Rachel. Scientific programming is far from her medical expertise, but she has accompanied me and made it our journey. To my son, Jon, and daughter, Rachel, who have rekindled the flame and for your promising future.
感谢全程支持我的丈夫 Rick,感谢你们上早班,让我工作到深夜。你从来没有让我放弃自己。对于我的父母和岳父母,感谢你们所有的帮助和支持。感谢我的儿子 Derek,他是我最大的灵感来源之一;你是我跳而不是跳的原因。
To my husband Rick, who supported me the entire way, thank you for taking the early shifts and letting me work into the night. You never let me give up on myself. To my parents and in-laws, thank you for all your help and support. And to my son, Derek, for being one of my biggest inspirations; you are the reason I leap instead of jump.
Part 1 Introduction to parallel computing
1.1 Why should you learn about parallel computing?
1.1.1 What are the potential benefits of parallel computing?
1.1.2 Parallel computing cautions
1.2 The fundamental laws of parallel computing
1.2.1 The limit to parallel computing: Amdahl’s Law
1.2.2 Breaking through the parallel limit: Gustafson-Barsis’s Law
1.3 How does parallel computing work?
1.3.1 Walking through a sample application
1.3.2 A hardware model for today’s heterogeneous parallel systems
1.3.3 The application/software model for today’s heterogeneous parallel systems
1.4 Categorizing parallel approaches
1.6 Parallel speedup versus comparative speedups: Two different measures
1.7 What will you learn in this book?
2 Planning for parallelization
2.1 Approaching a new project: The preparation
2.1.1 Version control: Creating a safety vault for your parallel code
2.1.2 Test suites: The first step to creating a robust, reliable application
2.1.3 Finding and fixing memory issues
2.1.4 Improving code portability
2.2 Profiling: Probing the gap between system capabilities and application performance
2.3 Planning: A foundation for success
2.3.1 Exploring with benchmarks and mini-apps
2.3.2 Design of the core data structures and code modularity
2.3.3 Algorithms: Redesign for parallel
2.4 Implementation: Where it all happens
2.5 Commit: Wrapping it up with quality
3 Performance limits and profiling
3.1 Know your application’s potential performance limits
3.2 Determine your hardware capabilities: Benchmarking
3.2.1 Tools for gathering system characteristics
3.2.2 Calculating theoretical maximum flops
3.2.3 The memory hierarchy and theoretical memory bandwidth
3.2.4 Empirical measurement of bandwidth and flops
3.2.5 Calculating the machine balance between flops and bandwidth
3.3 Characterizing your application: Profiling
3.3.2 Empirical measurement of processor clock frequency and energy consumption
3.3.3 Tracking memory during run time
4 Data design and performance models
4.1 Performance data structures: Data-oriented design
4.1.2 Array of Structures (AoS) versus Structures of Arrays (SoA)
4.1.3 Array of Structures of Arrays (AoSoA)
4.2 Three Cs of cache misses: Compulsory, capacity, conflict
4.3 Simple performance models: A case study
4.3.1 Full matrix data representations
4.3.2 Compressed sparse storage representations
4.4 Advanced performance models
5 Parallel algorithms and patterns
5.1 Algorithm analysis for parallel computing applications
5.2 Performance models versus algorithmic complexity
5.3 Parallel algorithms: What are they?
5.5 Spatial hashing: A highly-parallel algorithm
5.5.1 Using perfect hashing for spatial mesh operations
5.5.2 Using compact hashing for spatial mesh operations
5.6 Prefix sum (scan) pattern and its importance in parallel computing
5.6.1 Step-efficient parallel scan operation
5.6.2 Work-efficient parallel scan operation
5.6.3 Parallel scan operations for large arrays
5.7 Parallel global sum: Addressing the problem of associativity
5.8 Future of parallel algorithm research
Part 2 CPU: The parallel workhorse
6 Vectorization: FLOPs for free
6.1 Vectorization and single instruction, multiple data (SIMD) overview
6.2 Hardware trends for vectorization
6.3.1 Optimized libraries provide performance for little effort
6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time)
6.3.3 Teaching the compiler through hints: Pragmas and directives
6.3.4 Crappy loops, we got them: Use vector intrinsics
6.3.5 Not for the faint of heart: Using assembler code for vectorization
6.4 Programming style for better vectorization
6.5 Compiler flags relevant for vectorization for various compilers
6.6 OpenMP SIMD directives for better portability
7.2 典型的 OpenMP 使用案例:循环级、高级和 MPI 加 OpenMP
7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP
7.2.1 Loop-level OpenMP for quick parallelization
7.2.2 High-level OpenMP for better parallel performance
7.2.3 MPI 加上 OpenMP,可实现极高的可扩展性
7.2.3 MPI plus OpenMP for extreme scalability
7.3 Examples of standard loop-level OpenMP
7.3.1 Loop level OpenMP: Vector addition example
7.3.3 Loop level OpenMP: Stencil example
7.3.4 Performance of loop-level examples
7.3.5 Reduction example of a global sum using OpenMP threading
7.3.6 Potential loop-level OpenMP issues
7.4 Variable scope importance for correctness in OpenMP
7.5 Function-level OpenMP: Making a whole function thread parallel
7.6 Improving parallel scalability with high-level OpenMP
7.6.1 How to implement high-level OpenMP
7.6.2 Example of implementing high-level OpenMP
7.7 Hybrid threading and vectorization with OpenMP
7.8 Advanced examples using OpenMP
7.8.1 Stencil example with a separate pass for the x and y directions
7.8.2 Kahan summation implementation with OpenMP threading
7.8.3 Threaded implementation of the prefix scan algorithm
7.9 Threading tools essential for robust implementations
7.9.1 使用 Allinea/ARM MAP 快速获取应用程序的高级配置文件
7.9.1 Using Allinea/ARM MAP to get a quick high-level profile of your application
7.9.2 使用 Intel® Inspector 查找线程争用条件
7.9.2 Finding your thread race conditions with Intel® Inspector
7.10 Example of a task-based support algorithm
8.1 The basics for an MPI program
8.1.1 Basic MPI function calls for every MPI program
8.1.2 Compiler wrappers for simpler MPI programs
8.1.3 Using parallel startup commands
8.1.4 Minimum working example of an MPI program
8.2 The send and receive commands for process-to-process communication
8.3 Collective communication: A powerful component of MPI
8.3.1 Using a barrier to synchronize timers
8.3.2 Using the broadcast to handle small file input
8.3.3 Using a reduction to get a single value from across all processes
8.3.4 Using gather to put order in debug printouts
8.3.5 使用 scatter 和 gather 将数据发送到工作进程
8.3.5 Using scatter and gather to send data out to processes for work
8.4.1 Stream triad to measure bandwidth on the node
8.4.2 二维 (2D) 网格中的 Ghost Cell 交换
8.4.2 Ghost cell exchanges in a two-dimensional (2D) mesh
8.4.3 Ghost cell exchanges in a three-dimensional (3D) stencil calculation
8.5 Advanced MPI functionality to simplify code and enable optimizations
8.5.1 Using custom MPI data types for performance and code simplification
8.5.2 Cartesian topology support in MPI
8.5.3 Performance tests of ghost cell exchange variants
8.6 混合 MPI 和 OpenMP,可实现极高的可扩展性
8.6 Hybrid MPI plus OpenMP for extreme scalability
8.6.1 The benefits of hybrid MPI plus OpenMP
Part 3 GPUs: Built to accelerate
9 GPU architectures and concepts
9.1 The CPU-GPU system as an accelerated computational platform
9.1.1 集成 GPU:基于商品的系统上未被充分利用的选项
9.1.1 Integrated GPUs: An underused option on commodity-based systems
9.1.2 Dedicated GPUs: The workhorse option
9.2 The GPU and the thread engine
9.2.1 The compute unit is the streaming multiprocessor (or subslice)
9.2.2 Processing elements are the individual processors
9.2.3 Multiple data operations by each processing element
9.2.4 Calculating the peak theoretical flops for some leading GPUs
9.3 Characteristics of GPU memory spaces
9.3.1 Calculating theoretical peak memory bandwidth
9.3.2 Measuring the GPU stream benchmark
9.3.3 Roofline performance model for GPUs
9.3.4 使用 Mixbench 性能工具为工作负载选择最佳 GPU
9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
9.4 The PCI bus: CPU to GPU data transfer overhead
9.4.1 Theoretical bandwidth of the PCI bus
9.4.2 A benchmark application for PCI bandwidth
9.5 Multi-GPU platforms and MPI
9.5.1 Optimizing the data movement between GPUs across the network
9.5.2 A higher performance alternative to the PCI bus
9.6 Potential benefits of GPU-accelerated platforms
9.6.1 Reducing time-to-solution
9.6.2 Reducing energy use with GPUs
9.6.3 Reduction in cloud computing costs with GPUs
10.1 GPU programming abstractions: A common framework
10.1.2 Inability to coordinate among tasks
10.1.3 Terminology for GPU parallelism
10.1.4 将数据分解为独立的工作单元:NDRange 或网格
10.1.4 Data decomposition into independent units of work: An NDRange or grid
10.1.5 Work groups provide a right-sized chunk of work
10.1.6 Subgroups, warps, or wavefronts execute in lockstep
10.1.7 Work item: The basic unit of operation
10.1.8 SIMD or vector hardware
10.2 The code structure for the GPU programming model
10.2.1 “Me” programming: The concept of a parallel kernel
10.2.2 Thread indices: Mapping the local tile to the global world
10.2.4 How to address memory resources in your GPU programming model
10.3 Optimizing GPU resource usage
10.3.1 How many registers does my kernel use?
10.3.2 Occupancy: Making more work available for work group scheduling
10.4 Reduction pattern requires synchronization across work groups
10.5 Asynchronous computing through queues (streams)
10.6 Developing a plan to parallelize an application for GPUs
10.6.1 Case 1: 3D atmospheric simulation
10.6.2 Case 2: Unstructured mesh application
11 Directive-based GPU programming
11.1 Process to apply directives and pragmas for a GPU implementation
11.2 OpenACC: The easiest way to run on your GPU
11.2.2 Parallel compute regions in OpenACC for accelerating computations
11.2.3 使用指令减少 CPU 和 GPU 之间的数据移动
11.2.3 Using directives to reduce data movement between the CPU and the GPU
11.2.4 Optimizing the GPU kernels
11.2.5 Summary of performance results for the stream triad
11.2.6 Advanced OpenACC techniques
11.3 OpenMP: The heavyweight champ enters the world of accelerators
11.3.2 使用 OpenMP 在 GPU 上生成并行工作
11.3.2 Generating parallel work on the GPU with OpenMP
11.3.3 创建数据区域以使用 OpenMP 控制数据移动到 GPU
11.3.3 Creating data regions to control data movement to the GPU with OpenMP
11.3.4 Optimizing OpenMP for GPUs
11.3.5 Advanced OpenMP for GPUs
12 GPU languages: Getting down to basics
12.1 Features of a native GPU programming language
12.2 CUDA and HIP GPU languages: The low-level performance option
12.2.1 Writing and building your first CUDA application
12.2.2 A reduction kernel in CUDA: Life gets complicated
12.2.3 Hipifying the CUDA code
12.3 OpenCL for a portable open source GPU language
12.3.1 Writing and building your first OpenCL application
12.4 SYCL: An experimental C++ implementation goes mainstream
12.5 Higher-level languages for performance portability
12.5.1 Kokkos: A performance portability ecosystem
12.5.2 RAJA for a more adaptable performance portability layer
13.1 An overview of profiling tools
13.2 How to select a good workflow
13.3 Example problem: Shallow water simulation
13.4 A sample of a profiling workflow
13.4.1 Run the shallow water application
13.4.2 Profile the CPU code to develop a plan of action
13.4.3 Add OpenACC compute directives to begin the implementation step
13.4.4 Add data movement directives
13.4.5 Guided analysis can give you some suggested improvements
13.4.6 NVIDIA Nsight 工具套件可以成为强大的开发辅助工具
13.4.6 The NVIDIA Nsight suite of tools can be a powerful development aid
13.4.7 适用于 AMD GPU 生态系统的 CodeXL
13.4.7 CodeXL for the AMD GPU ecosystem
13.5 Don’t get lost in the swamp: Focus on the important metrics
13.5.1 Occupancy: Is there enough work?
13.5.2 Issue efficiency: Are your warps on break too often?
13.5.3 Achieved bandwidth: It always comes down to bandwidth
13.6 Containers and virtual machines provide alternate workflows
13.6.1 Docker containers as a workaround
13.6.2 Virtual machines using VirtualBox
13.7 Cloud options: A flexible and portable capability
Part 4 High performance computing ecosystems
14 Affinity: Truce with the kernel
14.1 Why is affinity important?
14.2 Discovering your architecture
14.3 Thread affinity with OpenMP
14.4 Process affinity with MPI
14.4.1 Default process placement with OpenMPI
14.4.2 控制:在 OpenMPI 中指定进程放置的基本技术
14.4.2 Taking control: Basic techniques for specifying process placement in OpenMPI
14.4.3 Affinity is more than just process binding: The full picture
14.5 Affinity for MPI plus OpenMP
14.6 Controlling affinity from the command line
14.6.1 使用 hwloc-bind 分配 affinity
14.6.1 Using hwloc-bind to assign affinity
14.6.2 使用 likwid-pin:likwid 工具套件中的 affinity 工具
14.6.2 Using likwid-pin: An affinity tool in the likwid tool suite
14.7 The future: Setting and changing affinity at run time
14.7.1 Setting affinities in your executable
14.7.2 Changing your process affinities during run time
15 Batch schedulers: Bringing order to chaos
15.1 The chaos of an unmanaged system
15.2 How not to be a nuisance when working on a busy cluster
15.2.1 Layout of a batch system for busy clusters
15.2.2 如何在繁忙的集群和 HPC 站点上保持礼貌:常见的 HPC 小烦恼
15.2.2 How to be courteous on busy clusters and HPC sites: Common HPC pet peeves
15.3 Submitting your first batch script
15.4 Automatic restarts for long-running jobs
15.5 Specifying dependencies in batch scripts
16 File operations for a parallel world
16.1 The components of a high-performance filesystem
16.2 Standard file operations: A parallel-to-serial interface
16.3 MPI 文件操作 (MPI-IO) 实现更加并行的世界
16.3 MPI file operations (MPI-IO) for a more parallel world
16.4 HDF5 is self-describing for better data management
16.5 Other parallel file software packages
16.6 Parallel filesystem: The hardware interface
16.6.1 您想知道的有关并行文件设置但不知道如何询问的所有信息
16.6.1 Everything you wanted to know about your parallel file setup but didn’t know how to ask
16.6.2 General hints that apply to all filesystems
16.6.3 Hints specific to particular filesystems
17 Tools and resources for better code
17.1 Version control systems: It all begins here
17.1.1 Distributed version control fits the more mobile world
17.1.2 Centralized version control for simplicity and code security
17.2 Timer routines for tracking code performance
17.3 Profilers: You can’t improve what you don’t measure
17.3.1 Simple text-based profilers for everyday use
17.3.2 High-level profilers for quickly identifying bottlenecks
17.3.3 Medium-level profilers to guide your application development
17.3.4 Detailed profilers give the gory details of hardware performance
17.4 Benchmarks and mini-apps: A window into system performance
17.4.1 Benchmarks measure system performance characteristics
17.4.2 Mini-apps give the application perspective
17.5 Detecting (and fixing) memory errors for a robust application
17.5.1 Valgrind Memcheck:开源备用数据库
17.5.1 Valgrind Memcheck: The open source standby
17.5.2 Dr. Memory for your memory ailments
17.5.3 Commercial memory tools for demanding applications
17.5.4 Compiler-based memory tools for convenience
17.5.5 Fence-post checkers detect out-of-bounds memory accesses
17.5.6 适用于强大 GPU 应用程序的 GPU 内存工具
17.5.6 GPU memory tools for robust GPU applications
17.6 Thread checkers for detecting race conditions
17.6.1 Intel® Inspector:带有 GUI 的争用条件检测工具
17.6.1 Intel® Inspector: A race condition detection tool with a GUI
17.6.2 Archer:用于检测争用条件的基于文本的工具
17.6.2 Archer: A text-based tool for detecting race conditions
17.7 Bug-busters: Debuggers to exterminate those bugs
17.7.1 TotalView 调试器在 HPC 站点上广泛可用
17.7.1 TotalView debugger is widely available at HPC sites
17.7.2 DDT 是 HPC 站点上广泛提供的另一种调试器
17.7.2 DDT is another debugger widely available at HPC sites
17.7.3 Linux 调试器:满足本地开发需求的免费替代方案
17.7.3 Linux debuggers: Free alternatives for your local development needs
17.7.4 GPU debuggers can help crush those GPU bugs
17.8 Profiling those file operations
17.9 Package managers: Your personal system administrator
17.9.1 Package managers for macOS
17.9.2 Package managers for Windows
17.9.3 Spack 包管理器:用于高性能计算的包管理器
17.9.3 The Spack package manager: A package manager for high performance computing
17.10 Modules: Loading specialized toolchains
17.10.1. TCL modules:用于加载软件工具链的原始 modules 系统
17.10.1 TCL modules: The original modules system for loading software toolchains
17.10.2 Lmod: A Lua-based alternative Modules implementation
17.11 Reflections and exercises
Bob Robey, Los Alamos, New Mexico
这是一件危险的事情,佛罗多,走出你的门。你踏上马路,如果你不站稳脚跟,就不知道你可能会被卷到哪里。
It's a dangerous business, Frodo, going out your door. You step onto the road, and if you don't keep your feet, there's no knowing where you might be swept off to.
我无法预见并行计算之旅将把我们带到哪里。“我们”是因为多年来许多同事分享了这段旅程。我的并行计算之旅始于 1990 年代初,当时我还在新墨西哥大学。我编写了一些可压缩的流体动力学代码来模拟激波管实验,并在我能接触到的每个系统上运行这些代码。因此,我和 Brian Smith、John Sobolewski 和 Frank Gilfeather 被要求提交一份关于高性能计算中心的提案。我们获得了这笔赠款,并于 1993 年建立了毛伊岛高性能计算中心。我在项目中的职责是在阿尔伯克基的新墨西哥大学提供课程并带领 20 名研究生开发并行计算。
I could not have foreseen where this journey into parallel computing would take us. “Us” because the journey has been shared by numerous colleagues over the years. My journey into parallel computing began in the early 1990s, while I was at the University of New Mexico. I had written some compressible fluid dynamics codes to model shock tube experiments and was running these on every system I could get my hands on. As a result, I along with Brian Smith, John Sobolewski, and Frank Gilfeather, was asked to submit a proposal for a high performance computing center. We won the grant and established the Maui High Performance Computing Center in 1993. My part in the project was to offer courses and lead 20 graduate students in developing parallel computing at the University of New Mexico in Albuquerque.
1990 年代是并行计算的形成时期。我记得 Al Geist 的一次演讲,他是并行虚拟机 (PVM) 的原始开发人员之一,也是 MPI 标准委员会的成员。他谈到了即将发布的 MPI 标准(1994 年 6 月)。他说它永远不会去任何地方,因为它太复杂了。Al 对复杂性的看法是正确的,但尽管如此,它还是起飞了,在几个月内,几乎所有并行应用程序都在使用它。MPI 成功的原因之一是已经准备好实施。Argonne 一直在开发 Chameleon,这是一种可移植性工具,可以在当时的消息传递语言之间进行转换,包括 P4、PVM、MPL 等。该项目很快更改为 MPICH,成为第一个高质量的 MPI 实现。十多年来,MPI 成为并行计算的代名词。几乎每个并行应用程序都是在 MPI 库之上构建的。
The 1990s were a formative time for parallel computing. I remember a talk by Al Geist, one of the original developers of Parallel Virtual Machine (PVM) and a member of the MPI standards committee. He talked about the soon-to-be released MPI standard (June, 1994). He said it would never go anywhere because it was too complex. Al was right about the complexity, but despite that, it took off, and within months it was used by nearly every parallel application. One of the reasons for the success of MPI is that there were implementations ready to go. Argonne had been developing Chameleon, a portability tool that would translate between the message-passing languages at that time, including P4, PVM, MPL, and many others. The project was quickly changed to MPICH, which became the first high-quality MPI implementation. For over a decade, MPI became synonymous with parallel computing. Nearly every parallel application was built on top of MPI libraries.
现在让我们快进到 2010 年和 GPU 的出现。我读到了一篇 Dobb 博士的文章,内容是关于使用 Kahan 和来补偿 GPU 上唯一可用的单精度算术。我认为这种方法可能有助于解决并行计算中一个长期存在的问题,即数组的全局和根据处理器的数量而变化。为了测试这一点,我想到了我儿子 Jon 在高中时写的流体动力学代码。他测试了问题中的质量和能量守恒随时间的变化,如果它的变化超过指定量,他将停止运行并退出程序。当他在华盛顿大学读大一的春假期间回家时,我们尝试了这种方法,并惊喜地发现质量守恒的改进程度如此之大。对于生产代码,这种简单技术的影响将被证明是重要的。我们在本书的 5.7 节中介绍了并行全局和的增强精度和算法。
Now let’s fast forward to 2010 and the emergence of GPUs. I came across a Dr. Dobb’s article on using a Kahan sum to compensate for the only single-precision arithmetic available on the GPU. I thought that maybe the approach could help resolve a long-standing issue in parallel computing, where the global sum of an array changes depending on the number of processors. To test this out, I thought of a fluid dynamics code that my son Jon wrote in high school. He tested the mass and energy conservation in the problem over time and would stop running and exit the program if it changed more than a specified amount. While he was home over Spring break from his freshman year at University of Washington, we tried out the method and were pleasantly surprised by how much the mass conservation improved. For production codes, the impact of this simple technique would prove to be important. We cover the enhanced precision sum algorithm for parallel global sums in section 5.7 in this book.
2011 年,我与三名学生 Neal Davis、David Nicholaeff 和 Dennis Trujillo 组织了一个暑期项目,看看我们是否可以获得更复杂的代码,如自适应网格细化 (AMR) 和非结构化任意拉格朗日-欧拉 (ALE) 应用程序,以在 GPU 上运行。结果是 CLAMR,一个完全在 GPU 上运行的 AMR 迷你应用程序。大部分应用程序都很容易移植。最困难的部分是确定每个单元格的邻居。最初的 CPU 代码使用 k-d 树算法,但基于树的算法很难移植到 GPU。夏季项目开始两周后,洛斯阿拉莫斯上方的山丘上爆发了 Las Conchas 大火,该镇已被疏散。我们动身前往圣达菲,学生们四散而去。在疏散期间,我在圣达菲市中心会见了 David Nicholaeff,讨论了 GPU 端口。他建议我们尝试使用哈希算法来替换邻居查找的基于树的代码。当时,我看着小镇上空燃烧的火势,想知道它是否已经烧到了我的房子。尽管如此,我还是同意尝试一下,哈希算法导致整个代码在 GPU 上运行。哈希技术是由 David、我上高中时的女儿 Rachel 和我自己推广的。这些哈希算法构成了第 5 章中介绍的许多算法的基础。
In 2011, I organized a summer project with three students, Neal Davis, David Nicholaeff, and Dennis Trujillo, to see if we could get more complex codes like adaptive mesh refinement (AMR) and unstructured arbitrary Lagrangian-Eulerian (ALE) applications to run on a GPU. The result was CLAMR, an AMR mini-app that ran entirely on a GPU. Much of the application was easy to port. The most difficult part was determining the neighbor for each cell. The original CPU code used a k-d tree algorithm, but tree-based algorithms are difficult to port to GPUs. Two weeks into the summer project, the Las Conchas Fire erupted in the hills above Los Alamos and the town was evacuated. We left for Santa Fe, and the students scattered. During the evacuation, I met with David Nicholaeff in downtown Santa Fe to discuss the GPU port. He suggested that we try using a hash algorithm to replace the tree-based code for the neighbor finding. At the time, I was watching the fire burning above the town and wondering if it had reached my house. In spite of that, I agreed to try it, and the hashing algorithm resulted in getting the entire code running on the GPU. The hashing technique was generalized by David, my daughter Rachel while she was in high school, and myself. These hash algorithms form the basis for many of the algorithms presented in chapter 5.
在接下来的几年里,Rebecka Tumblin、Peter Ahrens 和 Sara Hartse 开发了紧凑哈希技术。Gerald Collom 和 Colin Redman 在高中毕业时解决了用于 CPU 和 GPU 上重新映射操作的紧凑哈希这一更困难的问题。随着 GPU 并行算法的这些突破,在 GPU 上运行许多科学应用程序的障碍正在推翻。
In following years, the compact hashing techniques were developed by Rebecka Tumblin, Peter Ahrens, and Sara Hartse. The more difficult problem of compact hashing for remapping operations on the CPU and GPU was tackled by Gerald Collom and Colin Redman when they were just out of high school. With these breakthroughs in parallel algorithms for the GPU, the barriers to getting many scientific applications running on the GPU were toppling.
2016 年,我与我的联合创始人 Hai Ah Nam 和 Gabe Rockefeller 一起启动了洛斯阿拉莫斯国家实验室 (LANL) 并行计算暑期研究实习 (PCSRI) 计划。并行计算计划的目标是解决高性能计算系统日益复杂的问题。该计划是一个为期 10 周的暑期实习,包括各种并行计算主题的讲座,然后是由洛斯阿拉莫斯国家实验室工作人员指导的研究项目。我们有 12 到 18 名学生参加了暑期课程,许多人将其作为他们职业生涯的跳板。通过该计划,我们将继续应对并行和高性能计算面临的一些最新挑战。
In 2016, I started the Los Alamos National Laboratory (LANL) Parallel Computing Summer Research Internship (PCSRI) program along with my co-founders, Hai Ah Nam and Gabe Rockefeller. The goal of the parallel computing program was to address the growing complexity of high-performance computing systems. The program is a 10-week summer internship with lectures on various parallel computing topics, followed by a research project mentored by staff from Los Alamos National Laboratory. We have had anywhere from 12 to 18 students participating in the summer program, and many have used it as a springboard for their careers. Through this program, we continue to tackle some of the newest challenges facing parallel and high performance computing.
If there’s a book that you want to read, but it hasn’t been written yet, then you must write it.
我对并行计算的介绍是这样开始的,“在开始之前,请进入 4 楼尽头的房间,在我们的集群中安装那些 Knights Corner 处理器。康奈尔大学一位教授的这个请求鼓励我尝试新事物。我以为这是一项简单的努力,但后来却变成了通往高性能计算的动荡旅程。我首先学习了小型集群如何工作的基础知识,通过物理提升 40 磅的服务器来使用 BIOS 并运行我的第一个应用程序,然后在我安装的节点上优化这些应用程序。
My introduction to parallel computing began with, “Before you start, go into the room at the end of the 4th floor and install those Knights Corner processors in our cluster.” This request from a professor at Cornell University encouraged me to try something new. What I thought would be a simple endeavor turned into a tumultuous journey into high performance computing. I started with learning the basics of how a small cluster worked through physically lifting 40-lb servers to working with the BIOS and running my first application, and then optimizing these applications across the nodes I installed.
在短暂的家庭假期之后,尽管这令人生畏,但我申请了研究实习。被新墨西哥州的第一个并行计算暑期研究实习计划录取后,我有机会探索当今硬件上并行计算的复杂性,这就是我遇到 Bob 的地方。我对仅凭一些关于如何正确编写并行代码的知识就可以获得的性能提升着迷。我个人探索了如何编写更有效的 OpenMP 代码。我在应用程序优化方面的兴奋和进步为我打开了通往其他机会的大门,例如参加会议、在 Intel 用户组会议和 Supercomputing 的 Intel 展位上展示我的工作。我还受邀参加 2017 年 Salishan 会议并发表演讲。这是与高性能计算领域的一些领先梦想家交流想法的绝佳机会。
After a short family break, daunting as it was, I applied for a research internship. Being accepted into the first Parallel Computing Summer Research Internship program in New Mexico gave me the opportunity to explore the intricacies of parallel computing on today’s hardware and that is where I met Bob. I became enthralled with the gains in performance that were possible with just some knowledge of how to properly write parallel code. I personally explored how to write more effective OpenMP code. My excitement and progress in optimization of applications opened the door to other opportunities, such as attending conferences and presenting my work at the Intel User’s Group meeting and at the Intel booth at Supercomputing. I was also invited to attend and present at the 2017 Salishan Conference. That was a great opportunity to exchange ideas with some of the leading visionaries of high performance computing.
另一个很棒的经历是申请并参加 GPU 黑客马拉松。在黑客马拉松上,我们将代码移植到 OpenACC,在一周内,代码速度提高了 60 倍。想想看 — 以前需要一个月的计算现在可以在一夜之间完成。我充分挖掘了长期研究的潜力,申请了研究生院并选择了芝加哥大学,因为芝加哥大学与阿贡国家实验室有着密切的关系。在芝加哥大学,我得到了 Ian Foster 和 Henry Hoffmann 的建议。
Another great experience was applying for and participating in a GPU hackathon. At the hackathon, we ported a code over to OpenACC and, within a week, the code achieved a speedup by a factor of 60. Think of this—a calculation that previously took a month could now be done overnight. Fully diving into the potential of long-term research, I applied to graduate schools and chose University of Chicago, acknowledging its close relationship with Argonne National Laboratory. At the University of Chicago, I was advised by Ian Foster and Henry Hoffmann.
从我的经验中,我意识到个人互动对于学习如何编写并行代码有多么宝贵。我还对缺乏讨论当前硬件的教科书或参考资料感到沮丧。为了填补这一空白,我们编写了这本书,让那些刚接触并行和高性能计算的人更加容易。接受为芝加哥大学新生创建和教授计算机科学导论的挑战帮助我了解了该领域的新人。另一方面,在高级分布式系统课程中作为助教解释并行编程技术,使我能够与理解水平更高的学生一起工作。这两次经历都帮助我获得了在不同层面解释复杂主题的能力。
From my experiences, I realized how valuable personal interactions are to learning how to write parallel code. I also was frustrated by the lack of a textbook or reference that discusses the current hardware. To fill this gap, we have written this book to make it much easier for those new to parallel and high performance computing. Taking on the challenge of creating and teaching an introduction to computer science for incoming University of Chicago students helped me gain an understanding of those new to the field. On the other hand, explaining the parallel programming techniques as a teaching assistant in the Advanced Distributed Systems course allowed me to work with students with a higher level of understanding. Both these experiences helped me to attain the ability to explain complex topics at different levels.
我相信每个人都应该有机会学习有关编写高性能代码的重要材料,并且每个人都应该可以轻松访问它。我很幸运有导师和顾问引导我找到正确的网站链接或将他们的旧手稿交给我阅读和学习。尽管有些技术可能很困难,但更大的问题是缺乏连贯的文档或该领域的领先科学家作为导师。我知道不是每个人都拥有相同的资源,因此,我希望创作这本书可以填补目前存在的空白。
I believe that everyone should have the opportunity to learn this important material on writing performant code and that it should be easily accessible to everyone. I was lucky enough to have mentors and advisors that steered me to the right website links or hand me their old manuscripts to read and learn. Though some of the techniques can be difficult, the greater problem is the lack of a coherent documentation or access to leading scientists in the field as mentors. I understand not everyone has the same resources and, therefore, I hope that creating this book fills a void that currently exists.
从 2016 年开始,由 Bob Robey 领导的 LANL 科学家团队为洛斯阿拉莫斯国家实验室 (LANL) 并行计算暑期研究实习 (PCSRI) 开发了讲座材料。这些材料中的大部分内容都涉及即将上市的最新硬件。并行计算正在快速变化,几乎没有相应的文档。显然需要一本涵盖这些材料的书。正是在这一点上,Manning Publications 联系了 Bob,希望写一本关于并行计算的书。我们有一个粗略的材料草稿,那么这能有多难呢?因此,我开始了为期两年的努力,将所有内容转化为高质量的格式。
Beginning in 2016, a team of LANL scientists led by Bob Robey developed lecture materials for the Los Alamos National Laboratory (LANL) Parallel Computing Summer Research Internships (PCSRI). Much of this material addressed the latest hardware that is quickly coming to market. Parallel computing is changing at a rapid rate and there is little documentation to accompany it. A book covering the materials was clearly needed. It was at this point that Manning Publications contacted Bob about writing a book on parallel computing. We had a rough draft of the materials, so how hard could this be? Thus began a two-year effort to put it all into a high quality format.
主题和章节大纲在早期阶段根据我们暑期课程的讲座进行了明确定义。许多想法和技术都来自更大的高性能计算社区,因为我们正在努力实现百万兆次级计算水平,即计算性能比之前的 PB 级里程碑提高了一千倍。该社区包括美国能源部 (DOE) 卓越中心、百万兆次级计算项目以及一系列性能、便携性和生产力研讨会。我们计算讲座中材料的广度和深度反映了复杂的异构计算架构的深刻挑战。
The topics and chapter outline were well-defined at an early stage, based on the lectures for our summer program. Many of the ideas and techniques are drawn from the greater high performance computing community as we strive towards an exascale level of computing—a thousand-fold improvement in computational performance over the previous Petascale milestone. This community includes the Department of Energy (DOE) Centers of Excellence, the Exascale Computing Project, and a series of Performance, Portability, and Productivity workshops. The breadth and depth of the materials in our computing lectures reflect the deep challenges of the complex heterogeneous computing architectures.
我们称这本书中的材料为 “深度导言”。它从并行和高性能计算的基础知识开始,但如果不了解计算架构,就不可能实现最佳性能。我们试图在前进的过程中深入了解更深层次的理解,因为仅仅沿着小径旅行而不知道自己在哪里或要去哪里是不够的。我们提供工具来绘制地图并显示距离我们正在努力实现的目标有多远。
We call the material in this book “an introduction with depth.” It starts at the basics of parallel and high-performance computing, but without gaining knowledge of the computing architecture, it is not possible to achieve optimal performance. We try to give an insight into a deeper level of understanding as we go along because it is not enough to just travel along the trail without any idea of where you are or where you are going. We provide the tools to develop a map and to show how far the distance is to the goal that we are striving towards.
在本书的开头,Joe Schoonover 被邀请编写 GPU 材料,Yulie Zamora 负责 OpenMP 章节。Joe 提供了 GPU 部分的设计和布局,但不得不很快退出。Yulie 撰写了许多论文并发表了许多演讲,介绍了 OpenMP 如何适应这个勇敢的百万兆次级计算新世界,因此这些材料特别适合本书的 OpenMP 章节。Yulie 对百万兆次级计算挑战的深刻理解以及她为该领域的新手分解它的能力,为本书的创作做出了重要贡献。
At the outset of this book, Joe Schoonover was tapped to write the GPU materials and Yulie Zamora the OpenMP chapter. Joe provided the design and layout of the GPU sections, but had to quickly drop out. Yulie has written papers and given many presentations on how OpenMP fit into this brave new world of exascale computing, so this material was especially well-suited for the OpenMP chapter of the book. Yulie’s deep understanding of the challenges of exascale computing and her ability to break it down for newcomers to the field has been a critical contribution to the creation of this book.
我们要感谢所有帮助塑造这本书的人。我们名单上的第一个是 Fluid Dynamics 的 Joe Schoonover,他在教授并行计算方面表现出色,尤其是 GPU。Joe 是我们并行计算计划的联合负责人之一,在制定本书应该涵盖的内容方面发挥了重要作用。我们的其他联合负责人 Hai Ah Nam、Gabe Rockefeller、Kris Garrett、Eunmo Koo、Luke Van Roekel、Robert Bird、Jonas Lippuner 和 Matt Turner 为并行计算学校及其内容的成功做出了贡献。如果没有研究所所长 Stephan Eidenbenz 的支持和远见卓识,并行计算暑期计划的成立就不会发生。还要感谢 Scott Runnels 和 Daniel Israel,他们领导了 LANL 计算物理暑期学校并开创了学校的概念,为我们提供了一个可以效仿的模型。
We’d like to thank all who helped shape this book. First on our list is Joe Schoonover of Fluid Dynamics, who has gone above and beyond in teaching parallel computing, particularly with GPUs. Joe was one of the co-leads of our parallel computing program and instrumental in formulating what the book should cover. Our other co-leads, Hai Ah Nam, Gabe Rockefeller, Kris Garrett, Eunmo Koo, Luke Van Roekel, Robert Bird, Jonas Lippuner, and Matt Turner, have contributed to the success of the parallel computing school and its content. The founding of the parallel computing summer program would not have occurred without the support and vision of the Institute Director, Stephan Eidenbenz. Also thanks to Scott Runnels and Daniel Israel, who have led the LANL Computational Physics summer school and pioneered the school’s concept, giving us a model to follow.
我们很幸运地被并行计算和图书出版领域的专家所包围。感谢 Kate Bowman,她的写作专业知识帮助指导了早期章节的修订。Kate 在出版的各个方面都非常有才华,多年来一直担任图书索引员。我们还收到了 Bob 的儿子 Jon 的非正式评论;女儿 Rachel;和女婿 Bob Bird,他们每个人都在书中提到了他们的一些技术工作。Yulie 的丈夫 Rick 帮助提供了一些主题的专业知识,Dov Shlachter 审阅了一些早期草稿并提供了一些有用的反馈。
We are fortunate to be surrounded by experts in parallel computing and book publishing. Thanks to Kate Bowman, whose expertise on writing helped guide the revisions of the early chapters. Kate is incredibly talented in all aspects of publishing and has been a book indexer for many years. We have also had informal reviews from Bob’s son, Jon; daughter, Rachel; and son-in-law, Bob Bird, each of whom have some of their technical work mentioned in the book. Yulie’s husband, Rick, helped provide expertise with some topics, and Dov Shlachter reviewed some early drafts and provided some helpful feedback.
我们还要感谢那些在特定章节中找到自己的方式的合作者的专业知识。这包括洛斯阿拉莫斯的 Rao Garimella 和 Shane Fogerty 以及劳伦斯利弗莫尔国家实验室的 Matt Martineau,他们的工作被纳入第 4 章。特别感谢前面提到的许多学生的创新工作,他们的工作占据了第 5 章的大部分内容。英特尔的 Ron Green 多年来一直领导着如何使用英特尔编译器提供的向量化的工作,这为第 6 章奠定了基础。第 13 章中的海啸模拟源自由 Sarah Armstrong、Joseph Koby、Juan-Antonio Vigil 和 Vanessa Trujillo 组成的 McCurdy 高中团队,他们参加了 2007 年的新墨西哥州超级计算挑战赛。另外,感谢 Cristian Gomez 帮助绘制海啸插图。英特尔的 Doug Jacobsen 以及洛斯阿拉莫斯国家实验室的 Hai Ah Nam 和 Sam Gutiérrez 在工艺放置和亲和力方面的工作为第 14 章奠定了基础。此外,与洛斯阿拉莫斯国家实验室的 Galen Shipman 和 Brad Settlemyer、阿贡国家实验室的 Rob Ross、Rob Latham、Phil Carns、Shane Snyder 和西北大学的 Wei-Keng Liao 的 Datalib 团队合作,第 16 章和第 17 章中有关用于分析文件操作的 Darshan 工具的部分进行了合作。
We’d also like to acknowledge expertise from collaborators who found their way into specific chapters. This includes Rao Garimella and Shane Fogerty of Los Alamos and Matt Martineau of Lawrence Livermore National Laboratory whose work is incorporated in chapter 4. A special thanks goes to the innovative work of the many students mentioned earlier whose work fills much of chapter 5. Ron Green of Intel has for some years led the effort to document how to use the vectorization provided by the Intel compiler, forming the basis for chapter 6. The tsunami simulation in chapter 13 originated from the McCurdy High School team composed of Sarah Armstrong, Joseph Koby, Juan-Antonio Vigil, and Vanessa Trujillo, participating in the New Mexico Supercomputing Challenge in 2007. Also, thanks to Cristian Gomez for helping with the tsunami illustration. Work on process placement and affinity with Doug Jacobsen of Intel and Hai Ah Nam and Sam Gutiérrez of Los Alamos National Laboratory laid the foundation for chapter 14. Also, work with the Datalib team of Galen Shipman and Brad Settlemyer of Los Alamos National Laboratory, Rob Ross, Rob Latham, Phil Carns, Shane Snyder of Argonne National Laboratory, and Wei-Keng Liao of Northwestern University is reflected in chapter 16 and the section on the Darshan tool for profiling file operations in chapter 17.
我们也感谢 Manning Publications 的专业人员为创造更加精致和专业的产品所做的努力。我们的文字编辑 Frances Buran 在改进写作和提高可读性方面做得非常出色。她处理了高度技术和精确的语言,并以惊人的速度完成。还要感谢我们的制作编辑 Deirdre Hiam 将图形、公式和文本转换为精美的产品供我们的读者使用。我们还要感谢我们的校对员 Jason Everett。这本书的制作经理 Paul Wells 将所有这些工作都安排在了紧迫的时间表上。
We also appreciate the efforts of the Manning Publications professionals in creating a more polished and professional product. Our copy editor, Frances Buran, did a remarkable job improving the writing and making it more readable. She handled the highly technical and precise language and did it at an amazing pace. Also thanks to Deirdre Hiam, our production editor, for transforming the graphics, formulas, and text into a polished product for our readers. We would also like to thank Jason Everett, our proofreader. Paul Wells, the book’s Production Manager, kept all of this effort on a tight schedule.
Manning Publications 将大量评论纳入写作过程,包括写作风格、文字编辑、校对和技术内容。首先是 Manning Acquisitions 编辑 Mike Stephens,他认为需要一本关于这个主题的书。我们的开发编辑 Marina Michaels 帮助我们顺利完成这项艰巨的任务。Marina 在使普通观众更容易获得材料方面特别有帮助。Christopher Haupt 作为技术开发编辑,为我们提供了有关技术内容的宝贵反馈。我们特别感谢我们的技术打样员 Tuan Tran,他审查了所有示例的源代码。Tuan 在解决处理高性能计算软件和硬件配置挑战的困难方面做得非常出色。我们的评论编辑 Aleksandar Dragosavljevic' 招募了一批出色的评论者,他们涵盖了广泛的读者群体。这些评论者 Alain Couniot、Albert Choy、Alessandro Campeis、Angelo Costa、Arav Kapish Agarwal、Dana Robinson、Domingo Salazar、Hugo Durana、Jean-François Morin、Patrick Regan、Phillip G Bradford、Richard Tobias、Rob Kielty、Srdjan Santic、Tuan A. Tran 和 Vincent Douwe 为我们提供了宝贵的反馈,大大改进了最终产品。
Manning Publications incorporates numerous reviews into the writing process, including writing style, copyediting, proofreading, and technical content. First is the Manning Acquisitions Editor, Mike Stephens, who saw the need for a book on this topic. Our Development Editor, Marina Michaels, helped keep us on track for this huge effort. Marina was especially helpful in making the material more accessible to a general audience. Christopher Haupt as the Technical Development Editor gave us valuable feedback on the technical content. We especially thank Tuan Tran, our Technical Proofer, who reviewed the source code for all the examples. Tuan did a great job tackling the difficulties of handling the challenge of high-performance computing software and hardware configurations. Our review editor, Aleksandar Dragosavljevic´, recruited a great set of reviewers that spanned a broad cross-section of readers. These reviewers, Alain Couniot, Albert Choy, Alessandro Campeis, Angelo Costa, Arav Kapish Agarwal, Dana Robinson, Domingo Salazar, Hugo Durana, Jean-François Morin, Patrick Regan, Phillip G Bradford, Richard Tobias, Rob Kielty, Srdjan Santic, Tuan A. Tran, and Vincent Douwe gave us valuable feedback which substantially improved the final product.
探险家最重要的任务之一是为追随者绘制地图。对于我们这些突破科学和技术界限的人来说尤其如此。我们在本书中的目标是为那些刚开始学习并行和高性能计算的人以及那些想要拓宽该领域知识的人提供一个路线图。高性能计算是一个瞬息万变的领域,其中的语言和技术不断变化。因此,我们将专注于随着时间的推移保持稳定的基本面。对于 CPU 和 GPU 的计算机语言,我们强调多种语言的通用模式,以便您可以快速选择最适合当前任务的语言。
One of the most important tasks for an explorer is to draw a map for those who follow. This is especially true for those of us pushing the boundaries of science and technology. Our goal in this book is to provide a roadmap for those just starting to learn about parallel and high performance computing and for those who want to broaden their knowledge of the field. High performance computing is a rapidly changing field, where languages and technologies are constantly in flux. For this reason, we’ll focus on the fundamentals that stay steady over time. For the computer languages for CPUs and GPUs, we stress the common patterns across the many languages, so that you can quickly select the most appropriate language for your current task.
本书既面向高年级本科并行计算课程,也面向计算机专业人士的最新文献。如果您对性能感兴趣,无论是运行时、规模还是功耗,本书都将为您提供改进应用程序并超越竞争对手的工具。随着处理器达到规模、热量和功率的极限,我们不能指望下一代计算机来加速我们的应用程序。高技能和知识渊博的程序员对于从当今的应用程序中获得最大性能至关重要。
This book is targeted at both upper division undergraduate parallel computing classes and as state-of-the-art literature for computing professionals. If you are interested in performance, whether it be run time, scale, or power, this book will give you the tools to improve your application and outperform your competition. With processors reaching the limits of scale, heat, and power, we cannot count on the next generation computer to speed up our applications. Increasingly, highly skilled and knowledgeable programmers are critical for getting maximum performance from today’s applications.
在本书中,我们希望传达当今高性能计算硬件的关键理念。这些是性能编程的基本事实。这些主题是整本书的基础。
In this book, we hope to get across key ideas true for today’s high performance computing hardware. These are the basic truths of programming for performance. These themes underlie the entire book.
在高性能计算中,重要的不是编写代码的速度,而是编写的代码运行的速度。
In high performance computing, it is not how fast you write the code, it is how fast the code you write runs.
这个想法总结了为高性能计算编写应用程序的含义。对于大多数其他应用程序,重点是编写应用程序的速度。今天,计算机语言通常旨在促进更快的编程,而不是更好的代码性能。尽管这种编程方法长期以来一直注入高性能计算应用程序,但尚未被广泛记录或描述。在第 4 章中,我们讨论了最近被称为面向数据设计的编程方法中的这一不同重点。
This one thought sums up what it means to write applications for high performance computing. For most other applications, the focus is on how fast you can write an application. Today, computer languages are typically designed to promote quicker programming rather than better performing code. Although this programming approach has long infused high performing computing applications, it has not been widely documented or described. In chapter 4, we discuss this different focus in a programming methodology that has recently been coined as data-oriented design.
It is all about memory: how much you use and how often you load it.
即使您知道可用内存和内存操作几乎总是性能的限制因素,我们仍然倾向于花费大量时间考虑浮点运算。由于大多数当前计算硬件能够为每个内存负载执行 50 次浮点运算,因此浮点运算是次要问题。在几乎每一章中,我们都使用 STREAM 基准测试的实现(内存性能测试)来验证我们从硬件和编程语言获得了合理的性能。
Even when you know that available memory and memory operations are almost always the limiting factor in performance, we still tend to spend a lot of time thinking about floating-point operations. With most current computing hardware capable of 50 floating-point operations for every memory load, floating-point operations are a secondary concern. In almost every chapter, we use our implementation of the STREAM benchmark, a memory performance test, to verify that we are getting reasonable performance from the hardware and programming language.
If you load one value, you get eight or sixteen.
这就像买鸡蛋一样。你不能只得到一个。内存加载由 512 位的高速缓存行完成。对于 8 字节的双精度值,无论您是否愿意,都将加载 8 个值。将程序规划为使用多个值,最好使用八个连续值,以获得最佳性能。当你这样做时,使用其余的鸡蛋。
It’s like buying eggs. You can’t get just one. Memory loads are done by cache lines of 512 bits. For a double-precision value of 8 bytes, eight values will be loaded whether you want them or not. Plan your program to use more than one value, and preferably eight contiguous values, for best performance. And while you are at it, use the rest of the eggs.
If there are any flaws in your code, parallelization will expose them.
与同类串行应用程序相比,在高性能计算中需要更多地关注代码质量。这适用于开始并行化之前、并行化期间和并行化之后。使用并行化,您更有可能在程序中触发缺陷,并且还会发现调试更具挑战性,尤其是在大规模调试时。我们在第 2 章中介绍了提高软件质量的技术,然后在各章中我们提到了重要的工具,最后,在第 17 章中,我们列出了其他可能被证明有价值的工具。
Code quality requires more attention in high performance computing than a comparable serial application. This applies to before beginning parallelization, during parallelization, and after parallelization. With parallelization, you are more likely to trigger a flaw in your program and will also find debugging more challenging, especially at large scale. We introduce the techniques for improving software quality in chapter 2, then throughout the chapters we mention important tools, and finally, in chapter 17, we list other tools that can prove valuable.
这些关键主题超越了同样适用于所有 CPU 和 GPU 的硬件类型。这些限制之所以存在,是因为当前对硬件施加的物理限制。
These key themes transcend hardware types applying equally to all CPUs and GPUs. These exist because of the current physical constraints imposed on the hardware.
本书不期望您具备任何并行编程知识。它确实希望读者是熟练的程序员,最好使用编译的高性能计算语言,例如 C、C++ 或 Fortran。此外,还应使读者具备一些计算术语、操作系统基础知识和网络知识。读者还应该能够找到使用计算机的方法,包括安装软件和简化系统管理任务。
This book does not expect that you have any knowledge of parallel programming. It does expect that readers are proficient programmers, preferably in a compiled, high performance computing language such as C, C++, or Fortran. It is also expected that readers have some knowledge of computing terminology, operating system basics, and networking. Readers should also be able to find their way around their computer, including installing software and light system administration tasks.
计算硬件的知识可能是读者最重要的要求。我们建议您打开您的计算机,查看每个组件,并了解其物理特性。如果您无法打开计算机,请参阅附录 C 末尾的典型桌面系统的照片。例如,查看图 C.2 中 CPU 的底部和进入芯片的引脚森林。你能在那里放更多的针吗?现在,您可以了解为什么可以从系统的其他部分传输到 CPU 的数据量存在物理限制。当您需要更好地了解计算硬件或计算术语时,请返回这些照片和附录 A 中的词汇表。
The knowledge of computing hardware is perhaps the most important requirement for readers. We recommend opening up your computer, looking at each component, and getting an understanding of its physical characteristics. If you cannot open up your computer, see the photos of a typical desktop system at the end of appendix C. For example, look at the bottom of the CPU in figure C.2 and at the forest of pins going into the chip. Can you fit any more pins there? Now you can see why there is a physical limit to how much data can be transferred to the CPU from other parts of the system. Flip back to these photos and the glossary in appendix A when you need to have a better understanding of the computing hardware or computing terminology.
We have divided this book into four parts that comprise the world of high performance computing. These are
Part 2: Central processing unit (CPU) technologies (chapters 6-8)
Part 3: Graphics processing unit (GPU) technologies (chapters 9-13
Part 4: High performance computing (HPC) ecosystems (chapters 14-17)
主题的顺序是针对处理高性能计算项目的人。例如,对于应用程序项目,在开始项目之前,第 2 章中的软件工程主题是必不可少的。一旦软件工程就位,下一个决策就是数据结构和算法。然后是 CPU 和 GPU 的实现。最后,该应用程序适用于并行文件系统和高性能计算系统的其他独特特性。
The order of topics is oriented towards someone tackling a high performance computing project. For example, for an application project, the software engineering topics in chapter 2 are necessary before starting a project. Once the software engineering is in place, the next decisions are the data structures and algorithms. Then come the implementations for the CPU and GPU. Finally, the application is adapted for the parallel file system and other unique characteristics of a high performance computing system.
另一方面,我们的一些读者对获得并行编程的基本技能更感兴趣,并且可能希望直接进入 MPI 或 OpenMP 章节。但不要止步于此。今天,并行计算的功能远不止于此。从可以将应用程序速度提高一个数量级的 GPU,到可以提高代码质量或指出要优化的代码部分的工具,潜在的收益仅受您的时间和专业知识的限制。
On the other hand, some of our readers are more interested in gaining fundamental skills in parallel programming and might want to go directly into the MPI or OpenMP chapters. But don’t stop there. Today, there is so much more to parallel computing. From GPUs that can speed up your application another order of magnitude to tools that can improve your code quality or point out sections of code to optimize—the potential gains are only limited by your time and expertise.
如果你在并行计算课程中使用这本书,那么材料的范围至少足够两个学期。您可能会将这本书视为可以针对观众进行个性化的材料自助餐。通过选择要涵盖的主题,您可以根据自己的课程目标对其进行自定义。以下是可能的材料顺序:
If you are using this book for a class on parallel computing, the scope of the material is sufficient for at least two semesters. You might think of the book as a buffet of materials that can be individualized to the audience. By selecting the topics to cover, you can customize it for your own course objectives. Here is a possible sequence of material:
Chapter 3 approaches measuring hardware and application performance
Sections 4.1-4.2 describe the data-oriented design concept of programming, multi-dimensional arrays, and cache basics
Chapter 7 covers OpenMP (Open Multi-Processing) to get on-node parallelism
Chapter 8 covers MPI (Message Passing Interface) to get distributed parallelism across multiple nodes
Sections 14.1-14.5 introduce affinity and process placement concepts
Chapters 9 and 10 describe GPU hardware and programming models
Sections 11.1-11.2 focus on OpenACC to get applications running on the GPU
您可以向此列表添加算法、向量化、并行文件处理或更多 GPU 语言等主题。或者,您可以删除主题,以便将更多时间花在其余主题上。还有一些章节可以吸引学生继续自己探索并行计算的世界。
You can add topics such as algorithms, vectorization, parallel file handling, or more GPU languages to this list. Or you can remove a topic so that you can spend more time on the remaining topics. There still are additional chapters to tempt students to continue to explore the world of parallel computing on their own.
如果不实际编写代码并运行它,就无法学习并行计算。为此,我们提供了本书随附的大量示例。这些示例可在 https://github.com/EssentialsOfParallelComputing 上免费获得。您可以下载这些示例,既可以完整下载,也可以按章节单独下载。
You cannot learn parallel computing without actually writing code and running it. For this purpose, we provide a large set of examples that accompanies the book. The examples are freely available at https://github.com/EssentialsOfParallelComputing. You can download these examples either as a complete set or individually by chapter.
由于示例、硬件和软件的范围,随附的示例不可避免地会出现缺陷和错误。如果您发现错误或未完成的内容,我们鼓励对示例做出贡献。我们已经合并了一些来自读者的更改请求,非常感谢。此外,源代码存储库将是查找更正和源代码讨论的最佳位置。
With the scope of the examples, hardware, and software, there will inevitably be flaws and errors in the accompanying examples. If you find something that is in error or just not complete, we encourage contributions to the examples. We have already merged in some change requests from readers, which were greatly appreciated. Additionally, the source code repository will be the best place to look for corrections and source code discussions.
并行和高性能计算中最大的挑战可能在于所涉及的大量硬件和软件。过去,这些专用系统仅在特定站点可用。最近,硬件和软件变得更加民主化,甚至在台式机或笔记本电脑级别也得到了广泛使用。这是一个重大转变,可以使高性能计算软件的开发变得更加容易。但是,硬件和软件环境的设置是任务中最困难的部分。如果您有权访问已设置这些的并行计算集群,我们鼓励您利用它。最终,您可能希望设置自己的系统。这些示例最容易在 Linux 或 Unix 系统上使用,但在许多情况下,也应该可以在 Windows 和 MacOS 上运行,但需要付出一些额外的努力。当您发现该示例未在您的系统运行时,我们提供了 Docker 容器模板和 VirtualBox 安装脚本的替代方案。
Perhaps the biggest challenge in parallel and high performance computing is the wide range of hardware and software that is involved. In the past, these specialized systems were only available at specific sites. Recently, the hardware and software has become more democratized and widely available even at the desktop or laptop level. This is a substantial shift that can make software for high performance computing much easier to develop. However, the setup of the hardware and software environment is the most difficult part of the task. If you have access to a parallel computing cluster where these are already set up, we encourage you to take advantage of it. Eventually, you may want to set up your own system. The examples are easiest to use on a Linux or Unix system, but should also work on Windows and the MacOS in many cases with some additional effort. We have provided alternatives with Docker container templates and VirtualBox setup scripts when you find that the example doesn’t run on your system.
GPU 练习需要来自不同供应商的 GPU,包括 NVIDIA、AMD Radeon 和 Intel。任何努力在他们的系统上安装 GPU 图形驱动程序的人都不会感到惊讶,因为这些驱动程序在为示例设置本地系统时带来了最大的困难。某些 GPU 语言也可以在 CPU 上运行,从而允许在本地系统上为您没有的硬件开发代码。您可能还会发现在 CPU 上进行调试更容易。但是要查看实际性能,您必须拥有实际的 GPU 硬件。
The GPU exercises require GPUs from the different vendors, including NVIDIA, AMD Radeon, and Intel. Anyone who has struggled to get GPU graphics drivers installed on their system will not be surprised that these present the greatest difficulty in setting up your local system for the examples. Some of the GPU languages also can work on the CPU, allowing the development of code on a local system for hardware that you do not have. You may also find that debugging on the CPU is easier. But to see the actual performance, you will have to have the actual GPU hardware.
其他需要特殊安装的示例包括批处理系统和并行文件示例。批处理系统需要不止一台笔记本电脑或工作站来设置它,使其看起来像真正的安装。同样,并行文件示例最适合 Lustre 等专用文件系统,但基本示例适用于笔记本电脑或工作站。
Other examples requiring special installation include the batch system and the parallel file examples. A batch system requires more than a single laptop or workstation to set it up to look like a real installation. Similarly, the parallel file examples work best with a specialized filesystem like Lustre, though the basic examples will work on a laptop or workstation.
购买 Parallel and High Performance Computing 包括免费访问由 Manning Publications 运营的私人 Web 论坛,您可以在其中对本书发表评论、提出技术问题,并从作者和其他用户那里获得帮助。要访问论坛,请转到 https://livebook.manning.com/#!/book/parallel-and-high-performance-computing/discussion。您还可以在 https://livebook.manning.com/#!/discussion 上了解有关 Manning 论坛和行为准则的更多信息。曼宁对读者的承诺是提供一个场所,让读者个人之间以及读者与作者之间可以进行有意义的对话。这并不是对作者任何特定数量的参与承诺,他们对论坛的贡献仍然是自愿的(并且是无偿的)。我们建议您尝试向作者询问一些具有挑战性的问题,以免他们的兴趣偏离!只要这本书是印刷的,就可以从出版商的网站访问论坛和以前讨论的档案。
Purchase of Parallel and High Performance Computing includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum, go to https://livebook.manning.com/#!/book/parallel-and-high-performance-computing/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/#!/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
Manning Publications 还为每本书提供了一个名为 livebook 的在线论坛。我们的网站位于 https://livebook.manning.com/book/parallel-and-high-performance-computing。这是添加评论或扩展章节材料的好地方。
Manning Publications also provides an online discussion forum called livebook for each book. Our site is at https://livebook.manning.com/book/parallel-and-high-performance-computing. This is a good place to add comments or expand on the materials in the chapters.
Parallel and High Performance Computing 封面上的人物标题为“M'de de brosses à Vienne”,即“维也纳的画笔卖家”。这幅插图摘自雅克·格拉塞特·德·圣索沃尔(Jacques Grasset de Saint-Sauveur,1757-1810 年)创作的各国礼服集,标题为 Costumes de Différents Pays,于 1797 年在法国出版。每幅插图都是手工精心绘制和着色的。Grasset de Saint-Sauveur 丰富多样的藏品生动地提醒我们,就在 200 年前,世界城镇和地区的文化差异是多么大。人们彼此孤立,说着不同的方言和语言。在街头或乡村,仅凭他们的衣着就很容易识别他们住在哪里以及他们的职业或生活地位。
The figure on the cover of Parallel and High Performance Computing is captioned “M’de de brosses à Vienne,” or “Seller of brushes in Vienna.” The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757-1810), titled Costumes de Différents Pays, published in France in 1797. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress.
从那时起,我们的着装方式发生了变化,当时如此丰富的地区多样性已经消失了。现在很难区分不同大陆的居民,更不用说不同的城镇、地区或国家了。也许我们已经用文化多样性换取了更多样化的个人生活——当然是为了更多样化和快节奏的技术生活。
The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.
在很难区分一本计算机书籍和另一本电脑书的时代,曼宁用基于两个世纪前丰富多样的地区生活的书籍封面来颂扬计算机业务的创造力和主动性,这些书籍的封面通过格拉塞特·德·圣索沃尔 (Grasset de Saint-Sauveur) 的照片重新焕发了生机。
At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.
Robert (Bob) Robey 是洛斯阿拉莫斯国家实验室计算物理学部的技术人员科学家,也是新墨西哥大学的兼职研究员。他是 2016 年开始的 Parallel Computing Summer Research Internships 的创始人之一。他是 NSF/IEEE-TCPP 并行和分布式计算课程计划的成员。Bob 是新墨西哥州超级计算挑战赛的董事会成员,这是一项高中和初中教育计划,已进入第 30 个年头。多年来,他指导了数百名学生,并两次被公认为洛斯阿拉莫斯杰出学生导师。Bob 在新墨西哥大学 (University of New Mexico) 共同教授了一门并行计算课程,并在其他大学做过客座讲座。
Robert (Bob) Robey is a technical staff scientist in the Computational Physics Division at Los Alamos National Laboratory and an adjunct researcher at the University of New Mexico. He is a founder of the Parallel Computing Summer Research Internships started in 2016. He is a member of the NSF/IEEE-TCPP Curriculum Initiative on Parallel and Distributed Computing. Bob is a board member of the New Mexico Supercomputing Challenge, a high school and middle school educational program in its 30th year. He has mentored hundreds of students over the years and has twice been recognized as a Los Alamos Distinguished Student Mentor. Bob co-taught a parallel computing class at University of New Mexico and has given guest lectures at other universities.
Bob 的科学生涯始于在新墨西哥大学操作爆炸驱动和可压缩气体驱动减震管。这包括世界上最大的爆炸驱动激波管,直径为 20 英尺,长度超过 800 英尺。他进行了数百次爆炸和冲击波实验。为了支持他的实验工作,Bob 自 1990 年代初以来编写了几个可压缩流体动力学代码,并在国际期刊和出版物上撰写了许多文章。全 3D 模拟在当时很少见,这给计算资源带来了极限。对更多计算资源的搜索使他参与了高性能计算研究。
Bob began his scientific career by operating explosive-driven and compressible gas-driven shock tubes at the University of New Mexico. This includes the largest explosively-driven shock tube in the world at 20 feet in diameter and over 800 feet long. He conducted hundreds of experiments with explosions and shock waves. To support his experimental work, Bob has written several compressible fluid dynamics codes since the early 1990s and has authored many articles in international journals and publications. Full 3D simulations were a rarity at the time, stressing compute resources to the limit. The search for more compute resources led to his involvement in high performance computing research.
Bob 在新墨西哥大学工作了 12 年,从事实验、编写和运行可压缩流体动力学模拟,并创办了一个高性能计算中心。他是首席提案作者,为大学带来了数千万美元的研究资助。自 1998 年以来,他一直在洛斯阿拉莫斯国家实验室任职。在那里,他为在各种最新硬件上运行的大型多物理场代码做出了贡献。
Bob worked 12 years at the University of New Mexico conducting experiments, writing, and running compressible fluid dynamics simulations, and started a high performance computing center. He was a lead proposal writer and brought tens of millions of dollars of research grants to the university. Since 1998, he has held a position at the Los Alamos National Laboratory. While there, he contributed to large multi-physics codes running on a variety of the latest hardware.
Bob 是一位世界级的皮划艇运动员,他第一次在墨西哥和新墨西哥州以前未流淌的河流下滑。他也是一名登山家,曾登上三大洲的山峰,海拔超过 18,000 英尺。他是男女同校的 Los Alamos Venture 团队的领导者,并帮助进行西部河流的多日游。
Bob is a world-class kayaker with first descents down previously unrun rivers in Mexico and New Mexico. He is also a mountaineer with ascents of peaks on three continents up to over 18,000 feet in elevation. He is a leader in the co-ed Los Alamos Venture crew and helps out with multi-day trips down western rivers.
Bob 毕业于德克萨斯 A&M 大学,获得工商管理硕士学位和机械工程学士学位。他曾在新墨西哥大学数学系学习研究生课程。
Bob is a graduate of Texas A&M University with a Masters in Business Administration and a Bachelors degree in Mechanical Engineering. He has taken graduate coursework at University of New Mexico in the Mathematics Department.
Yuliana (Yulie) Zamora 正在芝加哥大学攻读计算机科学博士学位。Yulie 是芝加哥大学 CERES 势不可挡计算中心的 2017 年研究员,也是美国国家物理科学联盟 (NPSC) 的研究生研究员。
Yuliana (Yulie) Zamora is completing her PhD in Computer Science at the University of Chicago. Yulie is a 2017 fellow at the CERES Center of Unstoppable Computing at the University of Chicago and a National Physical Science Consortium (NPSC) graduate fellow.
Yulie 曾在洛斯阿拉莫斯国家实验室工作,并在阿贡国家实验室实习。在洛斯阿拉莫斯国家实验室,她优化了用于模拟野火和其他大气物理学的 Higrad Firetec 代码,用于一些顶级高性能计算系统。在阿贡国家实验室,她在高性能计算和机器学习的交叉领域工作。她参与的项目范围从 NVIDIA GPU 的性能预测到用于科学应用的机器学习代理模型。
Yulie has worked at the Los Alamos National Laboratory and interned at Argonne National Laboratory. At Los Alamos National Laboratory, she optimized the Higrad Firetec code used for simulating wildland fires and other atmospheric physics for some of the top high performance computing systems. At Argonne National Laboratory, she worked at the intersection of high performance computing and machine learning. She has worked on projects ranging from performance prediction on NVIDIA GPUs to machine learning surrogate models for scientific applications.
Yulie 为芝加哥大学的新生开发并教授了一门计算机科学导论课程。她将并行计算基础知识的许多基本概念合并到材料中。这门课程非常成功,她被要求一次又一次地教授它。为了获得更多的教学经验,她自愿担任芝加哥大学高级分布式系统课程的助教职位。
Yulie developed and taught an Introduction to Computer Science course for incoming University of Chicago students. She incorporated many of the basic concepts of parallel computing fundamentals into the material. The course was so successful, she was asked to teach it again and again. Wanting to gain more teaching experience, she volunteered for a teaching assistant position for an Advanced Distributed Systems course at the University of Chicago.
Yulie 的学士学位是康奈尔大学的土木工程。她在芝加哥大学完成了计算机科学硕士学位,并将很快在芝加哥大学完成计算机科学博士学位。
Yulie’s Bachelors degree is in Civil Engineering from Cornell University. She finished her Masters of Computer Science from University of Chicago and will soon complete her PhD in Computer Science, also from the University of Chicago.
本书的第一部分涵盖了对并行计算具有普遍重要性的主题。这些主题包括
The first part of this book covers topics of general importance to parallel computing. These topics include
虽然这些主题应该首先由并行程序员考虑,但这些主题对本书的所有读者来说并不相同。对于并行应用程序开发人员,本部分中的所有章节都解决了成功项目的前期问题。项目需要选择正确的硬件、正确的并行类型和正确的期望类型。在开始并行化工作之前,您应该确定适当的数据结构和算法;以后要改变这些要困难得多。
While these topics should be considered first by a parallel programmer, these will not have the same importance to all readers of this book. For the parallel application developer, all of the chapters in this part address upfront concerns for a successful project. A project needs to select the right hardware, the right type of parallelism, and the right kind of expectations. You should determine the appropriate data structures and algorithms before starting your parallelization efforts; it’s much harder to change these later.
即使您是并行应用程序开发人员,您也可能不需要讨论完整的材料深度。那些只希望适度并行或在开发团队中担任特定角色的人可能会发现对内容的粗略理解就足够了。如果您只想探索并行计算,我们建议您阅读第 1 章和第 5 章,然后略读其他部分以获取讨论并行计算时使用的术语。
Even if you are a parallel application developer, you may not need the full depth of material discussed. Those desiring only modest parallelism or serving a particular role on a team of developers might find a cursory understanding of the content sufficient. If you just want to explore parallel computing, we suggest reading chapter 1 and chapter 5, then skimming the others to get the terminology that is used in discussing parallel computing.
我们为那些可能没有软件工程背景或只需要复习的人提供了第 2 章。如果您不熟悉 CPU 硬件的所有细节,那么您可能需要逐步阅读第 3 章。了解当前的计算硬件和应用程序对于提取性能非常重要,但不必一下子全部了解。当您准备购买下一个计算系统时,请务必返回第 3 章,以便您可以跳过所有营销声明,了解对应用程序真正重要的内容。
We include chapter 2 for those who may not have a software engineering background or for those who just need a refresher. If you are new to all of the details of CPU hardware, then you may need to read chapter 3 in small increments. An understanding of the current computing hardware and your application is important in extracting performance, but it doesn’t have to come all at once. Be sure to return to chapter 3 when you are ready to purchase your next computing system so you can cut through all the marketing claims to what is really important for your application.
第 4 章中对数据设计和性能建模的讨论可能具有挑战性,因为它需要了解硬件细节、它们的性能和编译器才能充分理解。尽管由于高速缓存和编译器优化对性能的影响,这是一个重要的主题,但编写简单的并行程序并不是必需的。
The discussion of data design and performance modeling in chapter 4 can be challenging because it requires an understanding of hardware details, their performance, and compilers to fully appreciate. Although it’s an important topic due to the impact the cache and compiler optimizations have on performance, it’s not necessary for writing a simple parallel program.
我们鼓励您遵循本书的附带示例。您应该花一些时间探索 https://github.com/EssentialsOfParallelComputing 的这些软件存储库中提供的许多示例。
We encourage you to follow along with the accompanying examples for the book. You should spend some time exploring the many examples that are available in these software repositories at https://github.com/EssentialsOfParallelComputing.
这些示例按章节组织,包括在各种硬件和操作系统上进行设置的详细信息。为了帮助处理可移植性问题,Docker 中提供了 Ubuntu 发行版的示例容器构建。还有通过 VirtualBox 设置虚拟机的说明。如果你需要设置自己的系统,你可能想阅读第 13 章中关于 Docker 和虚拟机的部分。但是容器和虚拟机附带了不容易解决的受限环境。
The examples are organized by chapter and include detailed information for setup on various hardware and operating systems. For helping to deal with portability issues, there are sample container builds for Ubuntu distributions in Docker. There are also instructions for setting up a virtual machine through VirtualBox. If you have a need for setting your own system up, you may want to read the section on Docker and virtual machines in chapter 13. But containers and virtual machines come with restricted environments that are not easy to work around.
我们正在努力使容器构建和其他系统环境设置能够正常用于许多可能的系统配置。正确安装系统软件,尤其是 GPU 驱动程序和相关软件,是旅程中最具挑战性的部分。各种各样的操作系统、包括图形处理单元 (GPU) 在内的硬件以及经常被忽视的安装软件质量使这项任务变得困难。一种替代方法是使用已安装 软件的集群。尽管如此,在某些时候,在您的笔记本电脑或台式机上安装一些软件以获得更方便的开发资源还是很有帮助的。现在是时候翻开新的一页,进入并行计算的世界了。这是一个具有近乎无限性能和潜力的世界。
Our work is ongoing for the container builds and other system environment setups to work properly for the many possible system configurations. Getting the system software installed correctly, especially the GPU driver and associated software, is the most challenging part of the journey. The wide variety of operating systems, hardware including graphics processing units (GPUs), and the often overlooked quality of installation software makes this a difficult task. One alternative is to use a cluster where the software is already installed. Still, it is helpful at some point to get some software installed on your laptop or desktop for a more convenient development resource. Now it is time to turn the page and enter the world of parallel computing. It is a world of nearly unlimited performance and potential.
在当今世界,您会发现许多挑战需要广泛和有效地使用计算资源。传统上,大多数需要性能的应用程序都属于科学领域。但人工智能 (AI) 和机器学习应用程序预计将成为大规模计算的主要用户。这些应用的一些示例包括
In today’s world, you’ll find many challenges requiring extensive and efficient use of computing resources. Most of the applications requiring performance traditionally are in the scientific domain. But artificial intelligence (AI) and machine learning applications are projected to become the predominant users of large-scale computing. Some examples of these applications include
Modeling megafires to assist fire crews and to help the public
Modeling tsunamis and storm surges from hurricanes (see chapter 13 for a simple tsunami model)
Equipping emergency crews with running simulations of hazards such as flooding
使用本书中介绍的技术,您将能够处理更大的问题和数据集,同时还可以以 10 倍、100 倍甚至 1000 倍的速度运行模拟。典型的应用程序没有开发当今计算机的大部分计算能力。并行计算是释放计算机资源潜力的关键。那么,什么是并行计算,您如何使用它来增强您的应用程序呢?
With the techniques covered in this book, you will be able to handle larger problems and datasets, while also running simulations ten, a hundred, or even a thousand times faster. Typical applications leave much of the compute capability of today’s computers untapped. Parallel computing is the key to unlocking the potential of your computer resources. So what is parallel computing and how can you use it to supercharge your applications?
并行计算是指在单个实例上执行许多操作。充分利用并行计算不会自动发生。这需要程序员付出一些努力。首先,您必须识别并公开应用程序中并行性的可能性。潜在并行性或并发性意味着您证明在系统资源可用时可以安全地按任何顺序执行操作。而且,对于并行计算,还有一个额外的要求:这些操作必须同时发生。为此,您还必须适当地利用资源来同时执行这些操作。
Parallel computing is the execution of many operations at a single instance in time. Fully exploiting parallel computing does not happen automatically. It requires some effort from the programmer. First, you must identify and expose the potential for parallelism in an application. Potential parallelism, or concurrency, means that you certify that it is safe to conduct operations in any order as the system resources become available. And, with parallel computing, there is an additional requirement: these operations must occur at the same time. For this to happen, you must also properly leverage the resources to execute these operations simultaneously.
并行计算引入了串行世界中不存在的新问题。我们需要改变我们的思维过程以适应并行执行的额外复杂性,但随着实践,这成为第二天性。本书开始探索如何利用并行计算的强大功能。
Parallel computing introduces new concerns that are not present in a serial world. We need to change our thought processes to adapt to the additional complexities of parallel execution, but with practice, this becomes second nature. This book begins your discovery in how to access the power of parallel computing.
Life 提供了许多并行处理的示例,这些实例通常成为计算策略的基础。图 1.1 显示了一条超市结账队伍,其目标是让客户快速为他们想要购买的商品付款。这可以通过雇用多名收银员一次处理或签出一个客户来完成。在这种情况下,熟练的收银员可以更快地执行结账流程,从而让客户更快地离开。另一种策略是使用许多自助结账站,并允许客户自行执行流程。这种策略需要的超市人力资源更少,并且可以开辟更多的渠道来处理客户。客户可能无法像训练有素的收银员那样高效地结账,但也许更多的客户可以快速结账,因为平行度增加导致排队时间更短。
Life presents numerous examples of parallel processing, and these instances often become the basis for computing strategies. Figure 1.1 shows a supermarket checkout line, where the goal is to have customers quickly pay for the items they want to purchase. This can be done by employing multiple cashiers to process, or check out, the customers one at a time. In this case, the skilled cashiers can more quickly execute the checkout process so customers leave faster. Another strategy is to employ many self-checkout stations and allow customers to execute the process on their own. This strategy requires fewer human resources from the supermarket and can open more lanes to process customers. Customers may not be able to check themselves out as efficiently as a trained cashier, but perhaps more customers can check out quickly due to increased parallelism resulting in shorter lines.
我们通过开发算法来解决计算问题:实现预期结果的一系列步骤。在超市的类比中,结账的过程就是算法。在这种情况下,它包括从购物篮中卸载商品、扫描商品以获取价格以及为商品付款。此算法是 sequential (或 serial) 的;它必须遵循此顺序。如果有数百个客户需要执行此任务,则用于签出许多客户的算法包含可以利用的并行性。从理论上讲,任何两个客户在完成结账流程之间没有依赖关系。通过使用多条结账线或自助结账站,超市暴露了并行性,从而提高了客户购买商品和离开商店的速度。我们如何实现这种并行性的每种选择都会产生不同的成本和收益。
We solve computational problems by developing algorithms: a set of steps to achieve a desired result. In the supermarket analogy, the process of checking out is the algorithm. In this case, it includes unloading items from a basket, scanning the items to obtain a price, and paying for the items. This algorithm is sequential (or serial); it must follow this order. If there are hundreds of customers that need to execute this task, the algorithm for checking out many customers contains a parallelism that can be taken advantage of. Theoretically, there is no dependency between any two customers going through the checkout process. By using multiple checkout lines or self-checkout stations, supermarkets expose parallelism, thereby increasing the rate at which customers buy goods and leave the store. Each choice in how we implement this parallelism results in different costs and benefits.
图 1.1 超市结账队列中的日常并行性。结账收银员(带上限)处理他们的客户队列(带购物篮)。在左侧,一名收银员同时处理四个自助结账通道。在右侧,每个结账通道需要一名收银员。每个选项都会影响超市的成本和结账率。
Figure 1.1 Everyday parallelism in supermarket checkout queues. The checkout cashiers (with caps) process their queue of customers (with baskets). On the left, one cashier processes four self-checkout lanes simultaneously. On the right, one cashier is required for each checkout lane. Each option impacts the supermarket’s costs and checkout rates.
定义 并行计算是识别和公开算法中的并行性的实践,在我们的软件中表达这一点,并了解所选实现的成本、收益和局限性。
definition Parallel computing is the practice of identifying and exposing parallelism in algorithms, expressing this in our software, and understanding the costs, benefits, and limitations of the chosen implementation.
归根结底,并行计算与性能有关。这不仅包括速度,还包括问题的大小和能源效率。我们在本书中的目标是让您了解当前并行计算领域的广度,并让您熟悉足够多的最常用语言、技术和工具,以便您可以自信地处理并行计算项目。有关如何合并并行性的重要决策通常是在项目开始时做出的。合理的设计是迈向成功的重要一步。避免设计步骤可能会导致很久以后出现问题。保持期望切合实际并了解可用资源和项目性质同样重要。
In the end, parallel computing is about performance. This includes more than just speed, but also the size of the problem and energy efficiency. Our goal in this book is to give you an understanding of the breadth of the current parallel computing field and familiarize you with enough of the most commonly used languages, techniques, and tools so that you can tackle a parallel computing project with confidence. Important decisions about how to incorporate parallelism are often made at the outset of a project. A reasoned design is an important step toward success. Avoiding the design step can lead to problems much later. It is equally important to keep expectations realistic and to know both the available resources and the nature of the project.
本章的另一个目标是介绍并行计算中使用的术语。一种方法是将您指向附录 C 中的词汇表,以便在阅读本书时快速参考术语。由于这个领域和技术已经逐步发展,因此并行社区中的人员使用许多术语通常是草率和不精确的。随着应用程序内硬件和并行性复杂性的增加,我们从一开始就建立清晰、明确的术语使用非常重要。
Another goal of this chapter is to introduce the terminology used in parallel computing. One way to do that is to point you to the glossary in appendix C for a quick reference on terminology as you read this book. Because this field and the technology has grown incrementally, the use of many of the terms by those in the parallel community is oftentimes sloppy and imprecise. With the increased complexity of the hardware and of parallelism within applications, it’s important that we establish a clear, unambiguous use of terminology from the start.
欢迎来到并行计算的世界!随着您深入研究,技术和方法变得更加自然,您会发现它的力量令人着迷。您从未想过尝试的问题变得司空见惯。
Welcome to the world of parallel computing! As you delve deeper, the techniques and approaches become more natural, and you’ll find its power captivating. Problems that you never thought to attempt become commonplace.
未来是平行的。随着处理器设计达到小型化、时钟频率、功耗甚至热量的极限,串行性能的提高已经趋于稳定。图 1.2 显示了商用处理器的时钟频率(可以执行指令的速率)、功耗、计算内核(或简称内核)数量和硬件性能随时间变化的趋势。
The future is parallel. The increase in serial performance has plateaued as processor designs have hit the limits of miniaturization, clock frequency, power, and even heat. Figure 1.2 shows the trends in clock frequency (the rate at which an instruction can be executed), power consumption, the number of computational cores (or cores for short), and hardware performance over time for commodity processors.
图 1.2 1970 年至 2018 年的单线程性能、CPU 时钟频率 (MHz)、CPU 功耗 (watt) 和 CPU 核心数。并行计算时代始于 2005 年左右,当时 CPU 芯片中的内核数量开始增加,而时钟频率和功耗趋于平稳,但性能稳步提高(Horowitz 等人和 Rupp https://github.com/karlrupp/microprocessor-trend-data)。
Figure 1.2 Single thread performance, CPU clock frequency (MHz), CPU power consumption (watts), and the number of CPU cores from 1970 to 2018. The parallel computing era begins about 2005, when the core count in CPU chips begins to rise, while the clock frequency and power consumption plateaus, yet performance steadily increases (Horowitz et al. and Rupp, https://github.com/karlrupp/microprocessor-trend-data).
2005 年,内核数量突然从单核增加到多核。同时,时钟频率和功耗趋于平稳。理论性能稳步提高,因为性能与 clock frequency 和 core number 的乘积成正比。这种向增加内核数量而不是时钟速度的转变表明,只有通过并行计算才能实现中央处理器 (CPU) 的最理想性能。
In 2005, the number of cores abruptly increased from a single core to multiple cores. At the same time, clock frequency and power consumption flattened out. Theoretical performance steadily increased because performance is proportional to the product of clock frequency and the number of cores. This shift towards increasing the core count rather than the clock speed indicates that achieving the most ideal performance of a central processing unit (CPU) is only available through parallel computing.
现代消费级计算硬件配备了多个中央处理单元 (CPU) 和/或图形处理单元 (GPU),可同时处理多个指令集。这些较小的系统通常可与二十年前的超级计算机的计算能力相媲美。充分利用计算资源(在笔记本电脑、工作站、智能手机等上)要求您(程序员)具备可用于编写并行应用程序的工具的工作知识。您还必须了解提高并行度的硬件功能。
Modern consumer-grade computing hardware comes equipped with multiple central processing units (CPUs) and/or graphics processing units (GPUs) that process multiple instruction sets simultaneously. These smaller systems often rival the computing power of supercomputers of two decades ago. Making full use of compute resources (on laptops, workstations, smart phones, and so forth) requires you, the programmer, to have a working knowledge of the tools available for writing parallel applications. You must also understand the hardware features that boost parallelism.
因为有许多不同的并行硬件功能,所以这给程序员带来了新的复杂性。其中一项功能是 Intel 推出的超线程。将两个指令队列将工作交错到硬件逻辑单元,允许单个物理内核在操作系统 (OS) 中显示为两个内核。向量处理器是另一个硬件功能,大约在 2000 年开始出现在商用处理器中。这些指令一次执行多个指令。向量处理器的宽度(以位为单位)(也称为向量单位)指定要同时执行的指令数。因此,一个 256 位宽的向量单元可以一次执行 4 个 64 位(双精度)或 8 个 32 位(单精度)指令。
Because there are many different parallel hardware features, this presents new complexities to the programmer. One of these features is hyperthreading, introduced by Intel. Having two instruction queues interleaving work to the hardware logic units allows a single physical core to appear as two cores to the operating system (OS). Vector processors are another hardware feature that began appearing in commodity processors in about 2000. These execute multiple instructions at once. The width in bits of the vector processor (also called a vector unit) specifies the number of instructions to execute simultaneously. Thus, a 256 bit-wide vector unit can execute four 64-bit (doubles) or eight 32-bit (single-precision) instructions at one time.
软件开发工具的一些改进有助于将并行性添加到我们的工具包中,目前,研究社区正在做更多工作,但距离解决性能差距还有很长的路要走。这给我们(软件开发人员)带来了很大的负担,他们需要充分利用新一代处理器。
Some improvement in software development tools has helped to add parallelism to our toolkits, and currently, the research community is doing more, but it is still a long way from addressing the performance gap. This puts a lot of the burden on us, the software developers, to get the most from a new generation of processors.
遗憾的是,软件开发人员在适应计算能力的这一根本性变化方面落后了。此外,由于新的编程语言和应用程序编程接口 (API) 的爆炸式增长,将当前应用程序过渡到使用现代并行架构可能是一项艰巨的任务。但是,对应用程序有很好的应用知识,能够查看和公开并行性,以及对可用工具的深刻理解,可以带来巨大的好处。应用程序究竟会看到什么样的好处?让我们仔细看看。
Unfortunately, software developers have lagged in adapting to this fundamental change in computing power. Further, transitioning current applications to make use of modern parallel architectures can be daunting due to the explosion of new programming languages and application programming interfaces (APIs). But a good working knowledge of your application, an ability to see and expose parallelism, and a solid understanding of the tools available can result in substantial benefits. Exactly what kind of benefits would applications see? Let’s take a closer look.
并行计算可以缩短求解时间,提高应用程序的能效,并使您能够在当前现有硬件上解决更大的问题。今天,并行计算不再是最大计算系统的唯一领域。该技术现在存在于每个人的台式机或笔记本电脑中,甚至出现在手持设备上。这使得每个软件开发人员都可以在他们的本地系统上创建并行软件,从而大大扩展了新应用程序的机会。
Parallel computing can reduce your time to solution, increase the energy efficiency in your application, and enable you to tackle larger problems on currently existing hardware. Today, parallel computing no longer is the sole domain of the largest computing systems. The technology is now present in everyone’s desktop or laptop, and even on handheld devices. This makes it possible for every software developer to create parallel software on their local systems, thereby greatly expanding the opportunity for new applications.
随着兴趣从科学计算扩展到机器学习、大数据、计算机图形学和消费者应用程序,来自工业界和学术界的前沿研究揭示了并行计算的新领域。自动驾驶汽车、计算机视觉、语音识别和 AI 等新技术的出现需要消费类设备和开发领域的大量计算能力,其中必须使用和处理大量训练数据集。在长期以来一直是并行计算专属领域的科学计算中,也存在令人兴奋的新可能性。远程传感器和手持设备的激增可以将数据输入到更大、更真实的计算中,从而更好地为自然和人为灾害的决策提供信息,从而获得更广泛的数据。
Cutting edge research from both industry and academia reveals new areas for parallel computing as interest broadens from scientific computing into machine learning, big data, computer graphics, and consumer applications. The emergence of new technologies such as self-driving cars, computer vision, voice recognition, and AI requires large computational capabilities both within the consumer device and in the development sphere, where massive training datasets must be consumed and processed. And in scientific computing, which has long been the exclusive domain of parallel computing, there are also new, exciting possibilities. The proliferation of remote sensors and handheld devices that can feed data into larger, more realistic computations to better inform decision-making around natural and man-made disasters allows for more extensive data.
必须记住,并行计算本身并不是目标。相反,目标是并行计算的结果:减少运行时间、执行更大的计算或降低能耗。
It must be remembered that parallel computing itself is not the goal. Rather, the goals are what results from parallel computing: reducing run time, performing larger calculations, or decreasing energy consumption.
Faster run time with more compute cores
减少应用程序的运行时间或加速通常被认为是并行计算的主要目标。事实上,这通常是它最大的影响。并行计算可以加快密集型计算、多媒体处理和大数据操作,无论您的应用程序需要数天甚至数周的时间来处理,还是现在需要实时结果。
Reduction of an application’s run time, or the speedup, is often thought to be the primary goal of parallel computing. Indeed, this is usually its biggest impact. Parallel computing can speed up intensive calculations, multimedia processing, and big data operations, whether your applications take days or even weeks to process or the results are needed in real-time now.
过去,程序员会花更多的精力在串行优化上,以挤出一些百分比的改进。现在,有多种途径可供选择,有可能实现数量级的改进。这在探索可能的并行范式时产生了一个新问题 — 比编程人力更多的机会。但是,对应用程序的透彻了解和对并行机会的认识可以引导您走上一条更清晰的道路,以减少应用程序的运行时间。
In the past, the programmer would spend greater efforts on serial optimization to squeeze out a few percentage improvements. Now, there is the potential for orders of magnitude of improvement with multiple avenues to choose from. This creates a new problem in exploring the possible parallel paradigms—more opportunities than programming manpower. But, a thorough knowledge of your application and an awareness of parallelism opportunities can lead you down a clearer path towards reducing your application’s run time.
Larger problem sizes with more compute nodes
通过在应用程序中公开并行性,您可以将问题的大小扩展到串行应用程序无法达到的维度。这是因为计算资源的数量决定了可以做什么,而公开并行性允许您对更大的资源进行操作,从而提供以前从未考虑过的机会。更大的大小是通过更多的主内存、磁盘存储、网络和磁盘带宽以及 CPU 来实现的。与前面提到的超市类似,暴露并行性相当于雇用更多的收银员或开设更多的自助结账通道来处理更多且不断增长的客户。
By exposing parallelism in your application, you can scale up your problem’s size to dimensions that were out of reach with a serial application. This is because the amount of compute resources dictates what can be done, and exposing parallelism permits you to operate on larger resources, presenting opportunities that were never considered before. The larger sizes are enabled by greater amounts of main memory, disk storage, bandwidth over networks and to disk, and CPUs. In analogy with the supermarket as mentioned earlier, exposing parallelism is equivalent to employing more cashiers or opening more self-checkout lanes to handle a larger and growing number of customers.
Energy efficiency by doing more with less
并行计算的新影响领域之一是能源效率。随着手持设备中并行资源的出现,并行性可以加快应用程序的速度。这允许设备更快地返回睡眠模式,并允许使用速度较慢但并行更多、功耗更低的处理器。因此,将重量级多媒体应用程序移至 GPU 上运行,可以对能源效率产生更显著的影响,同时也会大大提高性能。采用并行化的最终结果降低了功耗并延长了电池寿命,这是该细分市场的强大竞争优势。
One of the new impact areas of parallel computing is energy efficiency. With the emergence of parallel resources in handheld devices, parallelism can speed up applications. This allows the device to return to sleep mode sooner and permits the use of slower, but more parallel processors that consume less power. Thus, moving heavy-weight multimedia applications to run on GPUs can have a more dramatic effect on energy efficiency while also resulting in vastly improved performance. The net result of employing parallelism reduces power consumption and extends battery life, which is a strong competitive advantage in this market niche.
能源效率很重要的另一个领域是远程传感器、网络设备和可操作的现场部署设备,例如远程气象站。通常,在没有大型电源的情况下,这些设备必须能够在资源较少的小封装中运行。并行性扩展了可以在这些设备上完成的工作,并在称为边缘计算的增长趋势中减轻了中央计算系统的工作。将计算移动到网络的最边缘可以在数据源进行处理,将其压缩为更小的结果集,以便更轻松地通过网络发送。
Another area where energy efficiency is important is with remote sensors, network devices, and operational field-deployed devices, such as remote weather stations. Often, without large power supplies, these devices must be able to function in small packages with few resources. Parallelism expands what can be done on these devices and offloads the work from the central computing system in a growing trend that is called edge compute. Moving the computation to the very edge of the network enables processing at the source of the data, condensing it into a smaller result set that can be more easily sent over the network.
如果不直接测量功耗,准确计算应用的能源成本是具有挑战性的。但是,您可以通过将制造商的热设计能力乘以应用程序的运行时间和使用的处理器数量来估算成本。热设计功率是在典型运行负载下消耗能量的速率。您的应用能耗可以使用以下公式估算
Accurately calculating the energy costs of an application is challenging without direct measurements of power usage. However, you can estimate the cost by multiplying the manufacturer’s thermal design power by the application’s run time and the number of processors used. Thermal design power is the rate at which energy is expended under typical operational loads. The energy consumption for your application can be estimated using the formula
P = (N 个处理器) × (R 瓦数/处理器) × (T 小时)
P = (N Processors) × (R Watts/Processors) × (T hours)
其中 P 是能耗,N 是处理器数量,R 是热设计功耗,T 是应用程序运行时。
where P is the energy consumption, N is the number of processors, R is the thermal design power, and T is the application run time.
通过 GPU 等加速器设备降低能源成本需要应用程序具有足够的可公开并行性。这允许有效地使用设备上的资源。
Achieving a reduction in energy cost through accelerator devices like GPUs requires that the application has sufficient parallelism that can be exposed. This permits the efficient use of the resources on the device.
Parallel computing can reduce costs
实际货币成本正成为软件开发人员团队、软件用户和研究人员越来越明显的关注点。随着应用程序和系统规模的增长,我们需要对可用资源进行成本效益分析。例如,对于下一代大型高性能计算 (HPC) 系统,电力成本预计将是硬件购置成本的三倍。
Actual monetary cost is becoming a more visible concern for software developer teams, software users, and researchers alike. As the size of applications and systems grows, we need to perform a cost-benefit analysis on the resources available. For example, with the next large High Performance Computing (HPC) systems, the power costs are projected to be three times the cost of the hardware acquisition.
使用成本也促进了云计算作为一种替代方案,它正越来越多地被学术界、初创企业和工业界采用。通常,云提供商根据所用资源的类型和数量以及使用这些资源所花费的时间进行计费。尽管 GPU 的单位时间通常比 CPU 更昂贵,但某些应用程序可以利用 GPU 加速器,以便相对于 CPU 费用而言,运行时间可以充分减少,从而降低成本。
Usage costs have also promoted cloud computing as an alternative, which is being increasingly adopted across academia, start-ups, and industries. In general, cloud providers bill by the type and quantity of resources used and the amount of time spent using these. Although GPUs are generally more expensive than CPUs per unit time, some applications can leverage GPU accelerators such that there are sufficient reductions in run time relative to the CPU expense to yield lower costs.
并行计算不是万能的。许多应用程序既不够大,也不需要足够的运行时间来需要并行计算。有些甚至可能没有足够的固有并行性来利用。此外,将应用程序过渡到利用多核和众核 (GPU) 硬件需要专门的工作,这可能会暂时将注意力从直接研究或产品目标上转移开。首先,必须认为时间和精力的投入是值得的。在使应用程序快速并扩展到更大的问题之前,应用程序运行并生成所需的结果始终更为重要。
Parallel computing is not a panacea. Many applications are neither large enough or require enough run time to need parallel computing. Some may not even have enough inherent parallelism to exploit. Also, transitioning applications to leverage multi-core and many-core (GPU) hardware requires a dedicated effort that can temporarily shift attention away from direct research or product goals. The investment of time and effort must first be deemed worthwhile. It is always more important that the application runs and generates the desired result before making it fast and scaling it up to larger problems.
我们强烈建议您从计划开始并行计算项目。了解哪些选项可用于加速应用程序,然后选择最适合您的项目的选项非常重要。在此之后,对所涉及的工作和潜在回报(在美元成本、能源消耗、解决方案时间和其他可能重要的指标方面)进行合理估计至关重要。在本章中,我们开始为您提供预先做出并行计算项目决策的知识和技能。
We strongly recommend that you start your parallel computing project with a plan. It’s important to know what options are available for accelerating the application, then select the most appropriate for your project. After that, it is crucial to have a reasonable estimate of the effort involved and the potential payoffs (in terms of dollar cost, energy consumption, time to solution, and other metrics that can be important). In this chapter, we begin to give you the knowledge and skills to make decisions on parallel computing projects up front.
在 serial computing 中,所有操作都随着 clock frequency 的增加而加速。相比之下,对于并行计算,我们需要考虑并修改我们的应用程序以充分利用并行硬件。为什么并行度量很重要?为了理解这一点,我们来看看并行计算定律。
In serial computing, all operations speed up as the clock frequency increases. In contrast, with parallel computing, we need to give some thought and modify our applications to fully exploit parallel hardware. Why is the amount of parallelism important? To understand this, let’s take a look at the parallel computing laws.
我们需要一种方法来根据并行代码的数量来计算计算的潜在加速。这可以使用 Gene Amdahl 于 1967 年提出的阿姆达尔定律来完成。该定律描述了随着处理器的增加,固定大小问题的加速。以下等式显示了这一点,其中 P 是代码的并行分数,S 是串行分数,这意味着 P + S = 1,N 是处理器的数量:
We need a way to calculate the potential speedup of a calculation based on the amount of the code that is parallel. This can be done using Amdahl’s Law, proposed by Gene Amdahl in 1967. This law describes the speedup of a fixed-size problem as the processors increase. The following equation shows this, where P is the parallel fraction of the code, S is the serial fraction, which means that P + S = 1, and N is the number of processors:
阿姆达尔定律强调,无论我们以多快的速度制作代码的并行部分,我们都将始终受到串行部分的限制。图 1.3 直观地显示了此限制。固定大小问题的这种缩放称为强缩放。
Amdahl’s Law highlights that no matter how fast we make the parallel part of the code, we will always be limited by the serial portion. Figure 1.3 visualizes this limitation. This scaling of a fixed-size problem is referred to as strong scaling.
图 1.3 根据 Amdahl 定律的固定大小问题的加速比显示为处理器数量的函数。当算法的 100% 并行化时,以及 90%、75% 和 50% 时,线条显示了理想的加速。阿姆达尔定律指出,加速受保持串行的代码分数的限制。
Figure 1.3 Speedup for a fixed-size problem according to Amdahl’s Law is shown as a function of the number of processors. Lines show ideal speedup when 100% of an algorithm is parallelized, and for 90%, 75%, and 50%. Amdahl’s Law states that speedup is limited by the fractions of code that remain serial.
Definition Strong scaling represents the time to solution with respect to the number of processors for a fixed total size.
Gustafson 和 Barsis 在 1988 年指出,随着添加更多处理器,并行代码运行应该会增加问题的大小。这可以为我们提供另一种方法来计算应用程序的潜在加速。如果问题的大小与处理器数量成比例增长,则加速比现在表示为
Gustafson and Barsis pointed out in 1988 that parallel code runs should increase the size of the problem as more processors are added. This can give us an alternate way to calculate the potential speedup of our application. If the size of the problem grows proportionally to the number of processors, the speedup is now expressed as
其中 N 是处理器的数量,S 是和以前一样的序列分数。结果是,通过使用更多的处理器,可以在相同的时间内解决更大的问题。这为利用并行性提供了更多机会。事实上,随着处理器数量的增加,问题的大小是有意义的,因为应用程序用户希望从附加处理器的强大功能中受益,并希望使用额外的内存。此方案的运行时扩展(如图 1.4 所示)称为弱扩展。
where N is the number of processors, and S is the serial fraction as before. The result is that a larger problem can be solved in the same time by using more processors. This provides additional opportunities to exploit parallelism. Indeed, growing the size of the problem with the number of processors makes sense because the application user wants to benefit from more than just the power of the additional processor and wants to use the additional memory. The run-time scaling for this scenario, shown in figure 1.4, is called weak scaling.
图 1.4 根据 Gustafson-Barsis 定律,当问题的大小随着可用处理器的数量增加而增加时的加速比显示为处理器数量的函数。当算法的 100% 并行化时,以及 90%、75% 和 50% 时,线条显示了理想的加速。
Figure 1.4 Speedup for when the size of a problem grows with the number of available processors according to Gustafson-Barsis’s Law is shown as a function of the number of processors. Lines show ideal speedup when 100% of an algorithm is parallelized, and for 90%, 75%, and 50%.
定义 弱缩放表示每个处理器的固定大小问题的处理器数的求解时间。
Definition Weak scaling represents the time to solution with respect to the number of processors for a fixed-sized problem per processor.
图 1.5 显示了可视化表示中强缩放和弱缩放之间的差异。网格大小在每个处理器上应保持不变的弱缩放参数很好地利用了附加处理器的资源。强扩展视角主要关注计算的加速。在实践中,强扩展和弱扩展都很重要,因为它们解决了不同的用户场景。
Figure 1.5 shows the difference between strong and weak scaling in a visual representation. The weak scaling argument that the mesh size should stay constant on each processor makes good use of the resources of the additional processor. The strong scaling perspective is primarily concerned with speedup of the calculation. In practice, both strong scaling and weak scaling are important because these address different user scenarios.
图 1.5 强缩放保持问题的整体大小不变,并将其拆分到其他处理器中。在弱缩放中,每个处理器的网格大小保持不变,总大小会增加。
Figure 1.5 Strong scaling keeps the same overall size of a problem and splits it across additional processors. In weak scaling, the size of the mesh stays the same for each processor and the total size increases.
术语可伸缩性通常用于指是否可以在硬件或软件中添加更多的并行性,以及是否可以对可以发生的改进程度进行总体限制。虽然传统的重点是运行时扩展,但我们将提出内存扩展通常更重要的论点。
The term scalability is often used to refer to whether more parallelism can be added in either the hardware or the software and whether there is an overall limit to how much improvement can occur. While the traditional focus is on the run-time scaling, we will make the argument that memory scaling is often more important.
图 1.6 显示了内存可扩展性有限的应用程序。复制数组 (R) 是在所有处理器之间复制的数据集。分布式阵列 (D) 在处理器之间进行分区和拆分。例如,在游戏模拟中,可以将 100 个字符分布在 4 个处理器上,每个处理器上有 25 个字符。但是游戏板的地图可能会复制到每个处理器。在图 1.6 中,复制的数组在整个网格中复制。由于此数字适用于弱扩展,因此问题大小会随着处理器数量的增加而增加。对于 4 个处理器,每个处理器上的阵列是 4 倍。随着处理器数量和问题大小的增加,处理器上很快就会没有足够的内存来运行作业。有限的运行时扩展意味着作业运行缓慢;有限的内存扩展意味着作业根本无法运行。此外,如果应用程序的内存可以分布,则运行时通常也会扩展。然而,反之则不一定是正确的。
Figure 1.6 shows an application with limited memory scalability. A replicated array (R) is a dataset that is duplicated across all the processors. A distributed array (D) is partitioned and split across the processors. For example, in a game simulation, 100 characters can be distributed across 4 processors with 25 characters on each processor. But the map of the game board might be copied to every processor. In figure 1.6, the replicated array is duplicated across the mesh. Because this figure is for weak scaling, the problem size grows as the number of processors increases. For 4 processors, the array is 4 times as large on each processor. As the number of processors and the size of the problem grows, soon there is not enough memory on a processor for the job to run. Limited run-time scaling means the job runs slowly; limited memory scaling means the job can’t run at all. It is also the case that if the application’s memory can be distributed, the run time usually scales as well. The reverse, however, is not necessarily true.
图 1.6 分布式数组的大小与问题相同,处理器数量翻倍(弱扩展)。但是复制(复制)数组需要每个处理器上的所有数据,并且内存会随着处理器数量的增加而快速增长。即使运行时的扩展性较弱(保持不变),内存要求也会限制可伸缩性。
Figure 1.6 Distributed arrays stay the same size as the problem and number of processors doubles (weak scaling). But replicated (copied) arrays need all the data on each processor, and memory grows rapidly with the number of processors. Even if the run time weakly scales (stays constant), the memory requirements limit scalability.
计算密集型作业的一种观点是,在每个处理周期中都会触及内存的每个字节,并且运行时是内存大小的函数。减小内存大小必然会缩短运行时间。因此,并行性的初始重点应该是随着处理器数量的增加而减小内存大小。
One view of a computationally intensive job is that every byte of memory gets touched in every cycle of processing, and run time is a function of memory size. Reducing memory size will necessarily reduce run time. The initial focus in parallelism should thus be to reduce the memory size as the number of processors grows.
并行计算需要结合对硬件、软件和并行性的理解来开发应用程序。它不仅仅是消息传递或线程处理。当前的硬件和软件提供了许多不同的选项,为您的应用程序带来并行化。其中一些选项可以组合使用,以产生更高的效率和速度。
Parallel computing requires combining an understanding of hardware, software, and parallelism to develop an application. It is more than just message passing or threading. Current hardware and software give many different options to bring parallelization to your application. Some of these options can be combined to yield even greater efficiency and speedup.
了解应用程序中的并行化以及不同的硬件组件允许您公开它的方式非常重要。此外,开发人员需要认识到,在源代码和硬件之间,应用程序必须遍历其他层,包括编译器和操作系统(图 1.7)。
It is important to have an understanding of the parallelization in your application and the way different hardware components allow you to expose it. Further, developers need to recognize that between your source code and the hardware, your application must traverse additional layers, including a compiler and an OS (figure 1.7).
图 1.7 并行化表示为应用程序软件层,该层通过编译器和操作系统映射到计算机硬件。
Figure 1.7 Parallelization is expressed in an application software layer that gets mapped to the computer hardware through the compiler and the OS.
作为开发人员,您负责应用程序软件层,其中包括您的源代码。在源代码中,您可以选择用于利用底层硬件的编程语言和并行软件接口。此外,您还可以决定如何将您的工作分解为并行单元。编译器旨在将源代码转换为硬件可以执行的形式。有了这些指令,操作系统就可以管理在计算机硬件上执行这些指令。
As a developer, you are responsible for the application software layer, which includes your source code. In the source code, you make choices about the programming language and parallel software interfaces you use to leverage the underlying hardware. Additionally, you decide how to break up your work into parallel units. A compiler is designed to translate your source code into a form the hardware can execute. With these instructions at hand, an OS manages executing these on the computer hardware.
我们将通过一个示例向您展示如何通过原型应用程序将并行化引入算法。此过程发生在应用程序软件层,但需要了解计算机硬件。现在,我们将避免讨论编译器和 OS 的选择。我们将逐步添加每一层并行化,以便您了解其工作原理。对于每种并行策略,我们将解释可用的硬件如何影响所做的选择。这样做的目的是演示硬件功能如何影响并行策略。我们将开发人员可以采用的并行方法进行分类
We will show you with an example how to introduce parallelization to an algorithm through a prototype application. This process takes place in the application software layer but requires an understanding of computer hardware. For now, we’ll refrain from discussing the choice in compiler and OS. We will incrementally add each layer of parallelization so that you can see how this works. With each parallel strategy, we will explain how the available hardware influences the choices that are made. The purpose in doing this is to demonstrate how hardware features influence the parallel strategies. We categorize the parallel approaches a developer can take into
在示例之后,我们将介绍一个模型来帮助您考虑现代硬件。此模型将现代计算硬件分解为单个组件和各种计算设备。本章包括一个简化的内存视图。第 3 章和第 4 章将更详细地介绍内存层次结构。最后,我们将更详细地讨论应用程序层和软件层。
Following the example, we will introduce a model to help you think about modern hardware. This model breaks down modern compute hardware into individual components and the variety of compute devices. A simplified view of memory is included in this chapter. A more detailed look at the memory hierarchy is presented in chapters 3 and 4. Finally, we will discuss in more detail the application and software layers.
如前所述,我们将开发人员可以采用的并行方法分类为基于进程的并行化、基于线程的并行化、向量化和流处理。基于具有自己的内存空间的单个进程的并行化可以是计算机的不同节点上或节点内的分布式内存。流处理通常与 GPU 相关联。现代硬件和应用程序软件的模型将帮助您更好地理解如何规划将应用程序移植到当前的并行硬件。
As mentioned, we categorize the parallel approaches a developer can take into process-based parallelization, thread-based parallelization, vectorization, and stream processing. Parallelization based on individual processes with their own memory spaces can be distributed memory on different nodes of a computer or within a node. Stream processing is generally associated with GPUs. The model for modern hardware and application software will help you better understand how to plan to port your application to current parallel hardware.
对于并行化的介绍,我们将介绍一种数据并行方法。这是最常见的并行计算应用程序策略之一。我们将在由矩形单元或单元的常规二维 (2D) 网格组成的空间网格上执行计算。创建空间网格和准备计算的步骤(此处汇总,稍后将详细介绍)包括
For this introduction to parallelization, we will look at a data parallel approach. This is one of the most common parallel computing application strategies. We’ll perform the computation on a spatial mesh composed of a regular two-dimensional (2D) grid of rectangular elements or cells. The steps (summarized here and described in detail later) to create the spatial mesh and prepare for the calculation are
Discretize (break up) the problem into smaller cells or elements
Define a computational kernel (operation) to conduct on each element of the mesh
Add the following layers of parallelization on CPUs and GPUs to perform the calculation:
我们从空间区域的 2D 问题域开始。为了便于说明,我们将使用喀拉喀托火山的 2D 图像(图 1.8)作为示例。我们计算的目标可能是对火山羽流、由此产生的海啸或使用机器学习对火山喷发的早期检测进行建模。对于所有这些选项,如果我们希望实时结果为我们的决策提供信息,计算速度至关重要。
We start with a 2D problem domain of a region of space. For purposes of illustration, we will use a 2D image of the Krakatau volcano (figure 1.8) as our example. The goal of our calculation could be to model the volcanic plume, the resulting tsunami, or the early detection of a volcanic eruption using machine learning. For all of these options, calculation speed is critical if we want real-time results to inform our decisions.
图 1.8 数值模拟的 2D 空间域示例。数值模拟通常涉及模板操作(见图 1.11)或大型矩阵-向量系统。这些类型的操作通常用于流体建模,以预测海啸到达时间、天气预报、烟柱扩散以及做出明智决策所需的其他过程。
Figure 1.8 An example 2D spatial domain for a numerical simulation. Numerical simulations typically involve stencil operations (see figure 1.11) or large matrix-vector systems. These types of operations are often used in fluids modeling to yield predictions of tsunami arrival times, weather forecasts, smoke plume spreading, and other processes necessary for informed decisions.
Step 1: Discretize the problem into smaller cells or elements
对于任何详细的计算,我们必须首先将问题的域分解成更小的部分(图 1.9),这个过程称为离散化。在图像处理中,这通常只是位图图像中的像素。对于计算域,这些称为单元或单元。单元或单元的集合形成一个计算网格,该网格覆盖了模拟的空间区域。每个单元格的数据值可以是整数、浮点数或双精度。
For any detailed calculation, we must first break up the domain of the problem into smaller pieces (figure 1.9), a process that is called discretization. In image processing, this is often just the pixels in a bitmap image. For a computational domain, these are called cells or elements. The collection of cells or elements form a computational mesh that covers the spatial region for the simulation. Data values for each cell might be integers, floats, or doubles.
图 1.9 结构域离散化为单元。对于计算域中的每个单元,根据物理定律求解波高、流体速度或烟雾密度等属性。最终,模板操作或矩阵向量系统表示这种离散方案。
Figure 1.9 The domain is discretized into cells. For each cell in the computational domain, properties such as wave height, fluid velocity, or smoke density are solved for according to physical laws. Ultimately, a stencil operation or a matrix-vector system represents this discrete scheme.
Step 2: Define a computational kernel, or operation, to conduct on each element of the mesh
对此离散化数据的计算通常是模板操作的某种形式,之所以这样称呼,是因为它涉及相邻单元格的模式来计算每个单元格的新值。这可以是平均值(模糊操作,使图像模糊或使其更模糊)、梯度(边缘检测,锐化图像中的边缘)或与求解由偏微分方程 (PDE) 描述的物理系统相关的其他更复杂的操作。图 1.10 将模板操作显示为五点模板,该模板使用模板值的加权平均值执行模糊操作。
The calculations on this discretized data are often some form of a stencil operation, so-called because it involves a pattern of adjacent cells to calculate the new value for each cell. This can be an average (a blur operation, which blurs the image or makes it fuzzier), a gradient (edge-detection, which sharpens the edges in the image), or another more complex operation associated with solving physical systems described by partial differential equations (PDEs). Figure 1.10 shows a stencil operation as a five-point stencil that performs a blur operation by using a weighted average of the stencil values.
图 1.10 作为计算网格上交叉图案的五点模板运算符。模板标记的数据将在操作中读取并存储在中心单元格中。此模式对每个单元格重复。模糊运算符是更简单的模板运算符之一,它是用大点标记的五个点的加权和,并更新模板中心点的值。这种类型的操作用于平滑操作或波传播数值模拟。
Figure 1.10 A five-point stencil operator as a cross pattern on the computational mesh. The data marked by the stencil are read in the operation and stored in the center cell. This pattern is repeated for every cell. The blur operator, one of the simpler stencil operators, is a weighted sum of the five points marked with the large dots and updates a value at the central point of the stencil. This type of operation is done for smoothing operations or wave propagation numerical simulations.
但是这些偏微分方程是什么呢?让我们回到我们的例子,想象一下这次它是一个由单独的红色、绿色和蓝色数组组成的彩色图像,用于制作 RGB 颜色模型。这里的术语“部分”意味着有多个变量,我们正在将红色随空间和时间的变化与绿色和蓝色的变化分开。然后,我们分别对每种颜色执行 blur 运算符。
But what are these partial differential equations? Let’s go back to our example and imagine this time it is a color image composed of separate red, green, and blue arrays to make an RGB color model. The term “partial” here means that there is more than one variable and that we are separating out the change of red with space and time from that of green and blue. Then we carry out the blur operator separately on each of these colors.
还有一个要求:我们需要应用随时间和空间的变化率。换句话说,红色会以一种速度传播,而绿色和蓝色会以其他速度传播。这可能是为了在图像上产生特殊效果,也可以描述在显影过程中真实颜色如何在摄影图像中渗透和合并。在科学界,我们可能有质量以及 x 和 y 速度,而不是红色、绿色和蓝色。再多一点物理学,我们可能会有波浪或灰烬羽流的运动。
There is one more requirement: we need to apply a rate of change with time and space. In other words, the red would spread at one rate and green and blue at others. This could be to produce a special effect on an image, or it can describe how real colors bleed and merge in a photographic image during development. In the scientific world, instead of red, green, and blue, we might have mass and x and y velocity. With the addition of a little more physics, we might have the motion of a wave or an ash plume.
Step 3: Vectorization to work on more than one unit of data at a time
我们从研究向量化开始引入并行化。什么是向量化?一些处理器能够一次对多个数据进行操作;一种称为向量运算的功能。图 1.11 中的阴影块说明了如何在处理器的向量单元中同时处理多个数据值,在一个 clock cycle中一条指令。
We start introducing parallelization by looking at vectorization. What is vectorization? Some processors have the ability to operate on more than one piece of data at a time; a capability referred to as vector operations. The shaded blocks in figure 1.11 illustrate how multiple data values are operated on simultaneously in a vector unit in a processor with one instruction in one clock cycle.
图 1.11 对四个双精度进行特殊向量运算。此操作可以在单个 clock cycle 中执行,而 serial operation 的额外能源成本很小。
Figure 1.11 A special vector operation is conducted on four doubles. This operation can be executed in a single clock cycle with little additional energy costs to the serial operation.
Step 4: Threads to deploy more than one compute pathway to engage more processing cores
因为现在的大多数 CPU 至少有四个处理核心,所以我们使用线程一次跨四行同时操作核心。图 1.12 显示了此过程。
Because most CPUs today have at least four processing cores, we use threading to operate the cores simultaneously across four rows at a time. Figure 1.12 shows this process.
Figure 1.12 Four threads process four rows of vector units simultaneously.
Step 5: Processes to spread out the calculation to separate memory spaces
我们可以进一步在两个桌面上的处理器之间拆分工作,这两个桌面通常称为并行处理中的节点。当工作在多个节点之间拆分时,每个节点的内存空间都是不同的。如图 1.13 所示,通过在行之间放置一个间隙来表示这一点。
We can further split the work between processors on two desktops, often called nodes in parallel processing. When the work is split across nodes, the memory spaces for each node are distinct and separate. This is indicated by putting a gap between the rows as in figure 1.13.
图 1.13 该算法可以通过在不同的进程之间分配 4×4 个区块来进一步并行化。每个进程使用四个线程,每个线程在单个 clock cycle 中处理一个 4 节点宽的 vector unit。图中的其他空白区域说明了进程边界。
Figure 1.13 This algorithm can be parallelized further by distributing the 4×4 blocks among distinct processes. Each process uses four threads, each handling a four-node-wide vector unit in a single clock cycle. Additional white space in the figure illustrates the process boundaries.
即使对于这种相当适中的硬件方案,也可能有 32 倍的加速。这如下所示:
Even for this fairly modest hardware scenario, there is a potential speedup of 32x. This is shown by the following:
2 个桌面(节点)× 4 个内核×(256 位宽向量单元)/(64 位双精度)= 32 倍的潜在加速
2 desktops (nodes) × 4 cores × (256 bit-wide vector unit)/(64-bit double) = 32x potential speedup
如果我们看一个具有 16 个节点、每个节点 36 个内核和一个 512 位向量处理器的高端集群,潜在的理论加速比串行进程快 4,608 倍:
If we look at a high-end cluster with 16 nodes, 36 cores per node, and a 512-bit vector processor, the potential theoretical speedup is 4,608 times faster than a serial process:
16 个节点× 36 个内核×(512 位宽向量单元)/(64 位双精度)= 4,608 倍的潜在加速
16 nodes × 36 cores × (512 bit-wide vector unit)/(64-bit double) = 4,608x potential speedup
Step 6: Off-loading the calculation to GPUs
GPU 是另一种用于增强并行化的硬件资源。借助 GPU,我们可以利用大量流式多处理器进行工作。例如,图 1.14 显示了如何将作品单独拆分为 8x8 的图块。使用 NVIDIA Volta GPU 的硬件规格,这些图块可以由分布在 84 个流式多处理器上的 32 个双精度内核进行操作,从而提供总共 2,688 个同时工作的双精度内核。如果我们在 16 节点集群中的每个节点都有一个 GPU,每个节点都有一个 2688 双精度流式多处理器,那么这就是 16 个 GPU 的 43008 路并行化。
The GPU is another hardware resource for supercharging parallelization. With GPUs, we can harness lots of streaming multiprocessors for work. For example, figure 1.14 shows how the work can be split up separately into 8x8 tiles. Using the hardware specifications for the NVIDIA Volta GPU, these tiles can be operated on by 32 double-precision cores spread out over 84 streaming multiprocessors, giving us a total of 2,688 double-precision cores that work simultaneously. If we have one GPU per node in a 16-node cluster, each with a 2,688 double-precision streaming multiprocessor, this is a 43,008-way parallelization from 16 GPUs.
图 1.14 在 GPU 上,向量长度比在 CPU 上大得多。在这里,8×8 个切片分布在 GPU 工作组中。
Figure 1.14 On a GPU, the vector length is much larger than on a CPU. Here, 8×8 tiles are distributed across GPU work groups.
这些数字令人印象深刻,但在这一点上,我们必须承认实际的加速远未达到这一全部潜力,从而降低期望。我们现在的挑战是组织如此极端和不同的并行化层,以获得尽可能多的加速。
These are impressive numbers, but at this point, we must temper expectations by acknowledging that actual speedup falls far short of this full potential. Our challenge now becomes organizing such extreme and disparate layers of parallelization to obtain as much speedup as possible.
在这个高级应用程序演练中,我们遗漏了许多重要的细节,我们将在后面的章节中介绍。但即使是这个名义上的详细程度也突出了公开算法并行化的一些策略。为了能够为其他问题制定类似的策略,必须了解现代硬件和软件。现在,我们更深入地研究当前的硬件和软件模型。这些概念模型是各种实际硬件的简化表示,以避免复杂性并在快速发展的系统中保持通用性。
For this high-level application walk-through, we left out a lot of important details, which we will cover in later chapters. But even this nominal level of detail highlights some of the strategies for exposing parallelization of an algorithm. To be able to develop similar strategies for other problems, an understanding of modern hardware and software is necessary. We now dive deeper into the current hardware and software models. These conceptual models are simplified representations of the diverse real-world hardware to avoid complexity and maintain generality over quickly evolving systems.
为了对并行计算的工作原理有一个基本的了解,我们将解释当今硬件中的组件。首先,称为 DRAM 的动态随机存取存储器存储信息或数据。计算核心(简称核心)执行算术运算(加、减、乘、除)、计算逻辑语句以及加载和存储来自 DRAM 的数据。对数据执行操作时,指令和数据将从内存加载到内核上,对其进行操作,然后存储回内存。现代 CPU(通常称为处理器)配备了许多能够并行执行这些操作的内核。配备加速器硬件(如 GPU)的系统也变得越来越普遍。GPU 配备了数千个内核和一个独立于 CPU DRAM 的内存空间。
To build a basic understanding of how parallel computing works, we’ll explain the components in today’s hardware. To begin, Dynamic Random Access Memory, called DRAM, stores information or data. A computational core, or core for short, performs arithmetic operations (add, subtract, multiply, divide), evaluates logical statements, and loads and stores data from DRAM. When an operation is performed on data, the instructions and data are loaded from memory onto the core, operated on, and stored back into memory. Modern CPUs, often called processors, are outfitted with many cores capable of executing these operations in parallel. It is also becoming common to find systems outfitted with accelerator hardware, like GPUs. GPUs are equipped with thousands of cores and a memory space that is separate from the CPU’s DRAM.
一个(或两个)处理器、DRAM 和加速器的组合构成了一个计算节点,可以在单个家庭台式机或超级计算机中的“机架”上下文中引用该节点。计算节点可以通过一个或多个网络(有时称为互连)相互连接。从概念上讲,节点运行操作系统的单个实例,该实例管理和控制所有硬件资源。随着硬件变得越来越复杂和异构,我们将从系统组件的简化模型开始,以便每个组件都更加明显。
A combination of a processor (or two), DRAM, and an accelerator compose a compute node, which can be referred to in the context of a single home desktop or a “rack” in a supercomputer. Compute nodes can be connected to each other with one or more networks, sometimes called an interconnect. Conceptually, a node runs a single instance of the OS that manages and controls all of the hardware resources. As hardware is becoming more complex and heterogeneous, we’ll start with simplified models of the system’s components so that each is more obvious.
Distributed memory architecture: A cross-node parallel method
最早也是最具可扩展性的并行计算方法之一是分布式内存集群(图 1.15)。每个 CPU 都有自己的本地内存,由 DRAM 组成,并通过通信网络与其他 CPU 连接。分布式内存集群的良好可扩展性源于它似乎具有无限的能力来整合更多节点。
One of the first and most scalable approaches to parallel computing is the distributed memory cluster (figure 1.15). Each CPU has its own local memory composed of DRAM and is connected to other CPUs by a communication network. Good scalability of distributed memory clusters arises from its seemingly limitless ability to incorporate more nodes.
图 1.15 分布式内存架构链接由独立内存空间组成的节点。这些节点可以是工作站或机架。
Figure 1.15 The distributed memory architecture links nodes composed of separate memory spaces. These nodes can be workstations or racks.
此体系结构还通过将总可寻址内存划分为每个节点的较小子空间来提供一些内存局部性,这使得在节点外访问内存与在节点上访问内存明显不同。这迫使程序员显式访问不同的内存区域。这样做的缺点是程序员必须在应用程序开始时管理内存空间的分区。
This architecture also provides some memory locality by dividing the total addressable memory into smaller subspaces for each node, which makes accessing memory off-node clearly different than on-node. This forces the programmer to explicitly access different memory regions. The disadvantage of this is that the programmer must manage the partitioning of the memory spaces at the outset of the application.
Shared memory architecture: An on-node parallel method
另一种方法是将两个 CPU 直接连接到同一个共享内存(图 1.16)。这种方法的优势在于处理器共享相同的地址空间,从而简化了编程。但这会引入潜在的内存冲突,从而导致正确性和性能问题。在多核 CPU 上的处理器内核之间同步内存访问和值既复杂又昂贵。
An alternative approach connects the two CPUs directly to the same shared memory (figure 1.16). The strength of this approach is that the processors share the same address space, which simplifies programming. But this introduces potential memory conflicts, resulting in correctness and performance issues. Synchronizing memory access and values between CPUs or the processing cores on a multi-core CPU is complicated and expensive.
Figure 1.16 The shared memory architecture provides parallelization within a node.
添加更多 CPU 和处理内核不会增加应用程序可用的内存量。这和同步成本限制了共享内存架构的可扩展性。
The addition of more CPUs and processing cores does not increase the amount of memory available to the application. This and the synchronization costs limit the scalability of the shared memory architecture.
Vector units: Multiple operations with one instruction
为什么不像过去那样直接增加处理器的时钟频率以获得更大的吞吐量呢?提高 CPU 时钟频率的最大限制是它需要更多的功率并产生更多的热量。无论是对已安装的电力线有限制的 HPC 超级计算中心,还是电池容量有限的手机,当今的设备都存在功率限制。这个问题被称为 power wall。
Why not just increase the clock frequency for the processor to get greater throughput as done in the past? The biggest limitation in increasing CPU clock frequencies is that it requires more power and produces more heat. Whether it is an HPC supercomputing center with limits on installed power lines or your cell phone with limited battery capacity, devices today all have power limitations. This problem is called the power wall.
与其增加 clock frequency,为什么每个周期不做多个操作呢?这就是许多处理器上向量化复兴的理念。与单个操作(更正式地称为标量操作)相比,在一个向量单元中执行多个操作只需要多一点能量。使用 vectorization,我们可以在单个 clock cycle 中处理比 serial process 更多的数据。多个操作(与仅一个操作相比)的功率要求几乎没有变化,执行时间的减少可以降低应用程序的能耗。与单车道道路相比,四车道高速公路允许四辆汽车同时移动,向量操作提供了更大的处理吞吐量。事实上,通过向量单元的四条路径(如图 1.17 中的不同阴影所示)通常称为向量运算的泳道。
Rather than increasing the clock frequency, why not do more than one operation per cycle? This is the idea behind the resurgence of vectorization on many processors. It takes only a little more energy to do multiple operations in a vector unit, compared to a single operation (more formally called a scalar operation). With vectorization, we can process more data in a single clock cycle than with a serial process. There is little change to the power requirements for multiple operations (versus just one), and a reduction in execution time can lead to a decrease in energy consumption for an application. Much like a four-lane freeway that allows four cars to move simultaneously in comparison to a single lane road, the vector operation gives greater processing throughput. Indeed, the four pathways through the vector unit, shown in different shadings in figure 1.17, are commonly called lanes of a vector operation.
大多数 CPU 和 GPU 具有一些向量化或等效操作的功能。在一个 clock cycle中处理的数据量,即 vector length ,取决于 processor 上 vector units 的大小。目前,最常用的向量长度是 256 位。如果离散化数据是 64 位双精度值,那么我们可以同时执行四个浮点运算作为向量运算。如图 1.17 所示,向量硬件单元一次加载一个数据块,同时对数据执行单个操作,然后存储结果。
Most CPUs and GPUs have some capability for vectorization or equivalent operations. The amount of data processed in one clock cycle, the vector length, depends on the size of the vector units on the processor. Currently, the most commonly available vector length is 256-bits. If the discretized data are 64-bit doubles, then we can do four floating-point operations simultaneously as a vector operation. As figure 1.17 illustrates, vector hardware units load one block of data at a time, perform a single operation on the data simultaneously, and then store the result.
Figure 1.17 Vector processing example with four array elements operated on simultaneously
Accelerator device: A special-purpose add-on processor
加速器设备是一种离散的硬件,旨在快速执行特定任务。最常见的加速器设备是 GPU。当用于计算时,此设备有时称为通用图形处理单元 (GPGPU)。GPU 包含许多小型处理内核,称为流式多处理器 (SM)。虽然比 CPU 内核更简单,但 SM 提供了大量的处理能力。通常,您会在 CPU 上找到一个小型集成 GPU。
An accelerator device is a discrete piece of hardware designed for executing specific tasks at a fast rate. The most common accelerator device is the GPU. When used for computation, this device is sometimes referred to as a general-purpose graphics processing unit (GPGPU). The GPU contains many small processing cores, called streaming multiprocessors (SMs). Although simpler than a CPU core, SMs provide a massive amount of processing power. Usually, you’ll find a small integrated GPU on the CPU.
大多数现代计算机还有一个单独的独立 GPU,通过外围组件接口 (PCI) 总线连接到 CPU(图 1.18)。这种总线会带来数据和指令的通信成本,但分立卡通常比集成单元更强大。例如,在高端系统中,NVIDIA 使用 NVLink,AMD Radeon 使用他们的 Infinity Fabric 来降低数据通信成本,但这个成本仍然很高。我们将在第 9-12 章中更多地讨论有趣的 GPU 架构。
Most modern computers also have a separate, discrete GPU connected to the CPU by the Peripheral Component Interface (PCI) bus (figure 1.18). This bus introduces a communication cost for data and instructions, but the discrete card is often more powerful than an integrated unit. In high-end systems, for example, NVIDIA uses NVLink and AMD Radeon uses their Infinity Fabric to reduce data communication costs, but this cost still is substantial. We will discuss the interesting GPU architecture more in chapters 9-12.
图 1.18 GPU 有两种类型:集成 GPU 和离散 GPU。独立或专用 GPU 通常具有大量流式多处理器和自己的 DRAM。访问独立 GPU 上的数据需要通过 PCI 总线进行通信。
Figure 1.18 GPUs come in two varieties: integrated and discrete. Discrete or dedicated GPUs typically have a large number of streaming multiprocessors and their own DRAM. Accessing data on a discrete GPU requires communication over a PCI bus.
General heterogeneous parallel architecture model
现在让我们将所有这些不同的硬件架构组合到一个模型中(图 1.19)。两个节点(每个节点有两个 CPU)共享相同的 DRAM 内存。每个 CPU 都是带有集成 GPU 的双核处理器。PCI 总线上的独立 GPU 也连接到其中一个 CPU。尽管 CPU 共享主内存,但这些 CPU 通常位于不同的非一致性内存访问 (NUMA) 区域中。这意味着访问第二个 CPU 的内存比获取它自己的内存更昂贵。
Now let’s combine all of these different hardware architectures into one model (figure 1.19). Two nodes, each with two CPUs, share the same DRAM memory. Each CPU is a dual-core processor with an integrated GPU. A discrete GPU on the PCI bus also attaches to one of the CPUs. Though the CPUs share main memory, these are commonly in different Non-Uniform Memory Access (NUMA) regions. This means that accessing the second CPU’s memory is more expensive than getting at it’s own memory.
图 1.19 由两个节点组成的通用异构并行架构模型。每个节点都有一个多核 CPU,带有一个集成的独立 GPU 和一些内存 (DRAM)。现代计算硬件通常对这些组件进行一些安排。
Figure 1.19 A general heterogeneous parallel architecture model consisting of two nodes connected by a network. Each node has a multi-core CPU with an integrated and discrete GPU and some memory (DRAM). Modern compute hardware normally has some arrangement of these components.
在整个硬件讨论中,我们提出了一个简化的内存层次结构模型,仅显示 DRAM 或主内存。我们在组合模型中展示了一个缓存(图 1.19),但没有详细说明它的组成或它是如何工作的。我们将对内存管理复杂性(包括多级缓存)的讨论保留在第 3 章中。在本节中,我们简单地介绍了当今硬件的一个模型,以帮助您识别可用的组件,以便您可以选择最适合您的应用程序和硬件选择的并行策略。
Throughout this hardware discussion, we have presented a simplified model of the memory hierarchy, showing just DRAM or main memory. We’ve shown a cache in the combined model (figure 1.19), but no detail on its composition or how it functions. We reserve our discussion of the complexities of memory management, including multiple levels of cache, for chapter 3. In this section, we simply presented a model for today’s hardware to help you identify the available components so that you can select the parallel strategy best suited for your application and hardware choices.
并行计算的软件模型必然由底层硬件驱动,但仍然与底层硬件不同。操作系统提供了两者之间的接口。并行操作不会自行启动;相反,源代码必须指示如何通过生成进程或线程来并行化工作;将数据、工作和指令卸载到计算设备;或一次对数据块进行操作。程序员必须首先公开并行化,确定并行操作的最佳技术,然后以安全、正确和高效的方式显式指导其操作。以下是最常见的并行化技术,然后我们将详细介绍其中的每一种:
The software model for parallel computing is necessarily motivated by the underlying hardware but is nonetheless distinct from it. The OS provides the interface between the two. Parallel operations do not spring to life on their own; rather, source code must indicate how to parallelize work by spawning processes or threads; offloading data, work, and instructions to a compute device; or operating on blocks of data at a time. The programmer must first expose the parallelization, determine the best technique to operate in parallel, and then explicitly direct its operation in a safe, correct, and efficient manner. The following methods are the most common techniques for parallelization, then we’ll go through each of these in detail:
Process-based parallelization: Message passing
消息传递方法是为分布式内存架构开发的,它使用显式消息在进程之间移动数据。在这个模型中,你的应用程序会生成单独的进程,在消息传递中称为 rank,它们有自己的内存空间和指令管道(图 1.20)。该图还显示,这些进程被交给 OS 以放置在处理器上。应用程序位于图中标记为用户空间的部分,用户有权在该部分进行操作。下面的部分是内核空间,它受到用户的保护,免受危险操作的影响。
The message passing approach was developed for distributed memory architectures, which uses explicit messages to move data between processes. In this model, your application spawns separate processes, called ranks in message passing, with their own memory space and instruction pipeline (figure 1.20). The figure also shows that the processes are handed to the OS for placement on the processors. The application lives in the part of the diagram marked as user space, where the user has permissions to operate. The part beneath is kernel space, which is protected from dangerous operations by the user.
图 1.20 消息传递库生成进程。操作系统将进程放置在两个节点的核心上。问号表示操作系统控制进程的位置,并且可以在运行时移动这些进程,如虚线箭头所示。OS 还从节点的主内存中为每个进程分配内存。
Figure 1.20 The message passing library spawns processes. The OS places the processes on the cores of two nodes. The question marks indicate that the OS controls the placement of the processes and can move these during run time as indicated by the dashed arrows. The OS also allocates memory for each process from the node’s main memory.
请记住,处理器 (CPU) 具有多个处理内核,这些内核不等同于进程。进程是一个操作系统概念,而处理器是一个硬件组件。对于应用程序生成的多少个进程,这些进程都由 OS 调度到处理核心。您实际上可以在四核笔记本电脑上运行 8 个进程,这些进程将直接换入和换出处理内核。因此,已经开发了一些机制来告诉操作系统如何放置进程以及是否将进程 “绑定” 到处理核心。控制绑定将在第 14 章中更详细地讨论。
Keep in mind that processors—CPUs—have multiple processing cores that are not equivalent to the processes. Processes are an OS concept, and processors are a hardware component. For however many processes the application spawns, these are scheduled by the OS to the processing cores. You can actually run eight processes on your quad-core laptop and these will just swap in and out of the processing cores. For this reason, mechanisms have been developed to tell the OS how to place processes and whether to “bind” the process to a processing core. Controlling binding is discussed in more detail in chapter 14.
要在进程之间移动数据,您需要将显式消息编程到应用程序中。这些消息可以通过网络或共享内存发送。许多消息传递库在 1992 年合并为消息传递接口 (MPI) 标准。从那时起,MPI 已经接管了这个利基市场,并存在于几乎所有扩展到单个节点之外的并行应用程序中。而且,是的,您还会发现 MPI 库的许多不同实现。
To move data between processes, you’ll need to program explicit messages into the application. These messages can be sent over a network or via shared memory. The many message-passing libraries coalesced into the Message Passing Interface (MPI) standard in 1992. Since then, MPI has taken over this niche and is present in almost all parallel applications that scale beyond a single node. And, yes, you’ll also find many different implementations of MPI libraries as well.
Thread-based parallelization: Shared data via memory
基于线程的并行化方法在同一进程中生成单独的指令指针(图 1.21)。因此,您可以轻松地在线程之间共享部分进程内存。但这也带来了正确性和性能缺陷。程序员可以确定指令集和数据的哪些部分是独立的,并且可以支持线程。第 7 章将更详细地讨论这些考虑因素,我们将介绍领先的线程系统之一 OpenMP。OpenMP 提供了生成线程并在线程之间划分工作的功能。
The thread-based approach to parallelization spawns separate instruction pointers within the same process (figure 1.21). As a result, you can easily share portions of the process memory between threads. But this comes with correctness and performance pitfalls. The programmer is left to determine which sections of the instruction set and data are independent and can support threading. These considerations are discussed in more detail in chapter 7, where we will look at OpenMP, one of the leading threading systems. OpenMP provides the capability to spawn threads and divide up the work among the threads.
图 1.21 基于线程的并行化方法中的应用程序进程会生成线程。线程仅限于节点的域。问号表示操作系统决定将线程放置在何处。一些内存在线程之间共享。
Figure 1.21 The application process in a thread-based approach to parallelization spawns threads. The threads are restricted to the node’s domain. The question marks show that the OS decides where to place the threads. Some memory is shared between threads.
线程方法有很多种,从重线程到轻量级,由用户空间或操作系统管理。虽然线程系统仅限于在单个节点内进行扩展,但对于适度的加速来说,这些是一个有吸引力的选择。但是,单个节点的内存限制对应用程序的影响更大。
There are many varieties of threading approaches, ranging from heavy to light-weight and managed by either the user space or the OS. While threading systems are limited to scaling within a single node, these are an attractive option for modest speedup. The memory limitations of the single node, however, have larger implications for the application.
Vectorization: Multiple operations with one instruction
向量化应用程序比在 HPC 中心扩展计算资源更具成本效益,而且这种方法在手机等便携式设备上可能是绝对必要的。向量化时,工作一次以 2-16 个数据项为块完成。此操作分类的更正式术语是单指令多数据 (SIMD)。在谈论向量化时,术语 SIMD 被广泛使用。SIMD 只是并行架构的一个类别,稍后将在 1.4 节中讨论。
Vectorizing an application can be far more cost-effective than expanding compute resources at an HPC center, and this method might be absolutely necessary on portable devices like cell phones. When vectorizing, work is done in blocks of 2-16 data items at a time. The more formal term for this operation classification is single instruction, multiple data (SIMD). The term SIMD is used a lot when talking about vectorization. SIMD is just one category of parallel architectures that will be discussed later in section 1.4.
从用户的应用程序调用向量化通常是通过源代码编译指示或编译器分析完成的。Pragma 和 Directives 是提供给编译器的提示,用于指导如何并行化或向量化一段代码。pragma 和编译器分析都高度依赖于编译器功能(图 1.22)。这里我们依赖于编译器,而以前的并行机制依赖于操作系统。此外,如果没有显式编译器标志,生成的代码将用于功能最弱的处理器和向量长度,从而显著降低向量化的有效性。有一些机制可以绕过编译器,但这些机制需要更多的编程工作,并且不可移植。
Invoking vectorization from a user’s application is most often done through source code pragmas or through compiler analysis. Pragmas and directives are hints given to the compiler to guide how to parallelize or vectorize a section of code. Both pragmas and compiler analysis are highly dependent on the compiler capabilities (figure 1.22). Here we are dependent on the compiler, where the previous parallel mechanisms were dependent on the OS. Also, without explicit compiler flags, the generated code is for the least powerful processor and vector length, significantly reducing the effectiveness of the vectorization. There are mechanisms where the compiler can be by-passed, but these require much more programming effort and are not portable.
图 1.22 源代码中从编译器返回不同性能级别的 vector 指令
Figure 1.22 Vector instructions in source code returning different performance levels from compilers
Stream processing through specialized processors
流处理是一个数据流概念,其中数据流由更简单的专用处理器处理。该技术长期用于嵌入式计算,适用于在专用处理器 GPU 中渲染计算机显示器的大型几何对象。这些 GPU 充满了广泛的算术运算和多个 SM,以并行处理几何数据。科学程序员很快就找到了使流处理适应大量模拟数据(如单元)的方法,从而将 GPU 的作用扩展到 GPGPU。
Stream processing is a dataflow concept, where a stream of data is processed by a simpler special-purpose processor. Long used in embedded computing, the technique was adapted for rendering large sets of geometric objects for computer displays in a specialized processor, the GPU. These GPUs were filled with a broad set of arithmetic operations and multiple SMs to process geometric data in parallel. Scientific programmers soon found ways to adapt stream processing to large sets of simulation data such as cells, expanding the role of the GPU to a GPGPU.
在图 1.23 中,数据和内核通过 PCI 总线卸载到 GPU 进行计算。与 CPU 相比,GPU 在功能上仍然受到限制,但在可以使用专用功能的地方,它们以较低的功耗要求提供了非凡的计算能力。其他专用处理器也属于这一类,尽管我们在讨论中主要关注 GPU。
In figure 1.23, the data and kernel are shown offloaded over the PCI bus to the GPU for computation. GPUs are still limited in functionality in comparison to CPUs, but where the specialized functionality can be used, these provide extraordinary compute capability at a lower power requirement. Other specialized processors also fit this category, though we focus on the GPU for our discussions.
图 1.23 在流处理方法中,数据和计算内核被卸载到 GPU 及其流式多处理器。处理后的数据或输出将传输回 CPU 以进行文件 IO 或其他工作。
Figure 1.23 In the stream processing approach, data and compute kernel are offloaded to the GPU and its streaming multiprocessors. Processed data, or output, transfers back to the CPU for file IO or other work.
如果您阅读有关并行计算的更多信息,您将遇到 SIMD(单指令、多数据)和 MIMD(多指令、多数据)等首字母缩略词。这些术语是指 Michael Flynn 于 1966 年在后来被称为 Flynn 分类法中提出的计算机体系结构类别。这些类有助于以不同的方式查看架构中的潜在并行化。分类是基于将指令和数据分解为串行或多个操作(图 1.24)。请注意,尽管分类法很有用,但某些体系结构和算法并不能完全归入一个类别。其有用之处在于识别 SIMD 等类别中对条件语句有潜在困难的模式。这是因为每个数据项可能希望位于不同的代码块中,但线程必须执行相同的指令。
If you read more about parallel computing, you will encounter acronyms such as SIMD (single instruction, multiple data) and MIMD (multiple instruction, multiple data). These terms refer to categories of computer architectures proposed by Michael Flynn in 1966 in what has become known as Flynn’s Taxonomy. These classes help to view potential parallelization in architectures in different ways. The categorization is based on breaking up instructions and data into either serial or multiple operations (figure 1.24). Be aware that though the taxonomy is useful, some architectures and algorithms do not fit neatly within a category. The usefulness comes from recognizing patterns in categories such as SIMD that have potential difficulties with conditionals. This is because each data item might want to be in a different block of code, but the threads have to execute the same instruction.
图 1.24 Flynn 的分类法对不同的并行架构进行了分类。串行架构是单数据、单指令 (SISD)。两个类别只有部分并行化,因为指令或数据是并行的,而另一个是串行的。
Figure 1.24 Flynn’s Taxonomy categorizes different parallel architectures. A serial architecture is single data, single instruction (SISD). Two categories only have partial parallelization in that either the instructions or data are parallel, but the other is serial.
在有多个指令序列的情况下,该类别称为多指令、单个数据 (MISD)。这不是一个常见的架构;最好的例子是对相同数据进行冗余计算。这用于高度容错的方法,例如航天器控制器。由于航天器处于高辐射环境中,因此这些探测器通常会对每个计算运行两个副本,并比较两者的输出。
In the case where there is more than one instruction sequence, the category is called multiple instruction, single data (MISD). This is not a common architecture; the best example is a redundant computation on the same data. This is used in highly fault-tolerant approaches such as spacecraft controllers. Because spacecraft are in high radiation environments, these often run two copies of each calculation and compare the output of the two.
向量化是 SIMD 的一个主要示例,其中对多个数据执行相同的指令。SIMD 的一个变体是单指令、多线程 (SIMT),通常用于描述 GPU 工作组。
Vectorization is a prime example of SIMD in which the same instruction is performed across multiple data. A variant of SIMD is single instruction, multi-thread (SIMT), which is commonly used to describe GPU work groups.
最后一类在指令和数据中都具有并行化,称为 MIMD。此类别描述构成大多数大型并行系统的多核并行体系结构。
The final category has parallelization in both instructions and data and is referred to as MIMD. This category describes multi-core parallel architectures that comprise the majority of large parallel systems.
到目前为止,在 1.3.1 节的初始示例中,我们查看了单元格或像素的数据并行化。但数据并行化也可用于粒子和其他数据对象。数据并行化是最常见的方法,通常也是最简单的方法。本质上,每个进程都执行相同的程序,但对唯一的数据子集进行操作,如图 1.25 的右上角所示。数据并行方法的优点是,它可以随着问题大小和处理器数量的增长而很好地扩展。
So far in our initial example in section 1.3.1, we looked at data parallelization for cells or pixels. But data parallelization can also be used for particles and other data objects. Data parallelization is the most common approach and often the simplest. Essentially, each process executes the same program but operates on a unique subset of data as illustrated in the upper right of figure 1.25. The data parallel approach has the advantage that it scales well as the problem size and number of processors grow.
图 1.25 各种任务和数据并行策略,包括 main-worker、pipeline 或 bucket-brigade 以及数据并行
Figure 1.25 Various task and data parallel strategies, including main-worker, pipeline or bucket-brigade and data parallelism
另一种方法是任务并行。这包括带有 worker threads、pipeline 或 bucket-brigade 策略的主控制器,如图 1.25 所示。流水线方法用于超标量处理器,其中地址和整数计算是使用单独的 logic unit 而不是浮点处理器完成的,从而允许并行完成这些计算。bucket-brigade 使用每个处理器在一系列操作中对数据进行操作和转换。在主工作线程方法中,一个处理器为所有工作线程安排和分配任务,每个工作线程在返回上一个已完成的任务时检查下一个工作项。还可以组合不同的并行策略来获得更大程度的并行性。
Another approach is task parallelism. This includes the main controller with worker threads, pipeline, or bucket-brigade strategies also shown in figure 1.25. The pipeline approach is used in superscalar processors where address and integer calculations are done with a separate logic unit rather than the floating-point processer, allowing these calculations to be done in parallel. The bucket-brigade uses each processor to operate on and transform the data in a sequence of operations. In the main-worker approach, one processor schedules and distributes the tasks for all the workers, and each worker checks for the next work item as it returns the previous completed task. It is also possible to combine different parallel strategies to expose a greater degree of parallelism.
我们将在本书中介绍许多比较性能数据和加速。通常,术语 speedup 用于比较两个不同的运行时间,几乎没有解释或上下文以充分理解它的含义。Speedup 是一个通用术语,在许多情况下使用,例如量化优化的效果。为了阐明并行效果数据这两大类之间的区别,我们将定义两个不同的术语。
We will present a lot of comparative performance numbers and speedups throughout this book. Often the term speedup is used to compare two different run times with little explanation or context to fully understand what it means. Speedup is a general term that is used in many contexts such as quantifying the effects of optimization, for example. To clarify the difference between the two major categories of parallel performance numbers, we’ll define two different terms.
并行加速比 — 我们真的应该称这种串行到并行加速比。加速比相对于在标准平台(通常是单个 CPU)上运行的基准串行而言。并行加速可能是由于在 GPU 上运行,或者在计算机系统节点上的所有内核上运行 OpenMP 或 MPI。
Parallel speedup—We should really call this serial-to-parallel speedup. The speedup is relative to a baseline serial run on a standard platform, usually a single CPU. The parallel speedup can be due to running on a GPU or with OpenMP or MPI on all the cores on the node of a computer system.
比较加速 — 我们真的应该将其称为架构之间的比较加速。这通常是两个并行实现之间的性能比较,或者是合理约束的硬件集之间的其他比较。例如,它可能位于计算机节点的所有内核上的并行 MPI 实现与节点上的 GPU 之间。
Comparative speedup—We should really call this comparative speedup between architectures. This is usually a performance comparison between two parallel implementations or other comparison between reasonably constrained sets of hardware. For example, it may be between a parallel MPI implementation on all the cores of the node of a computer versus the GPU(s) on a node.
这两类性能比较代表了两个不同的目标。首先是了解通过添加特定类型的并行度可以获得多少加速。但是,这并不是架构之间的公平比较。这是关于并行加速的。例如,将 GPU 运行时间与串行 CPU 运行进行比较并不是多核 CPU 和 GPU 之间的公平比较。在尝试将多核 CPU 与节点上一个或多个 GPU 的性能进行比较时,架构之间的比较加速更合适。
These two categories of performance comparisons represent two different goals. The first is to understand how much speedup can be obtained through adding a particular type of parallelism. It is not a fair comparison between architectures, however. It is about parallel speedup. For example, comparing a GPU run time to a serial CPU run is not a fair comparison between a multi-core CPU and the GPU. Comparative speedups between architectures are more appropriate when trying to compare a multi-core CPU to the performance of one or more GPUs on a node.
近年来,一些人已经将这两种架构标准化,以便比较相似功率或能源要求的相对性能,而不是任意节点。尽管如此,仍有如此多不同的架构和可能的组合,因此可以获得任何性能数字来证明结论的合理性。您可以选择快速 GPU 和慢速 CPU,或者选择四核 CPU 而不是 16 核处理器。因此,我们建议您在效果比较项的括号中添加以下术语,以帮助提供更多上下文:
In recent years, some have normalized the two architectures so that relative performance is compared for similar power or energy requirements rather than an arbitrary node. Still, there are so many different architectures and possible combinations that any performance numbers to justify a conclusion can be obtained. You can pick a fast GPU and a slow CPU or a quad-core CPU versus a 16-core processor. We are therefore suggesting you add the following terms in parenthesis to performance comparisons to help give these more context:
将 (Best 2016) 添加到每个术语。例如,并行加速比 (Best 2016) 和比较加速比 (Best 2016) 表示比较的是特定年份(本例中为 2016 年)发布的最佳硬件之间的比较,您可能会将高端 GPU 与高端 CPU 进行比较。
Add (Best 2016) to each term. For example, parallel speedup (Best 2016) and comparative speedup (Best 2016) would indicate that the comparison is between the best hardware released in a particular year (2016 in this example), where you might compare a high-end GPU to a high-end CPU.
如果这两个体系结构于 2016 年发布,但不是最高端的硬件,请添加 (Common 2016) 或 (2016)。这可能与拥有比高端系统更多的主流部件的开发人员和用户有关。
Add (Common 2016) or (2016) if the two architectures were released in 2016 but are not the highest-end hardware. This might be relevant to developers and users who have more mainstream parts than that found in the top-end systems.
如果 GPU 和 CPU 在 2016 年的 Mac 笔记本电脑或台式机中发布,则添加 (Mac 2016),或者对于在一段时间内具有固定组件的其他品牌(本例中为 2016 年),则添加 (Mac 2016)。这种类型的性能比较对于常用系统的用户很有价值。
Add (Mac 2016) if the GPU and the CPU were released in a 2016 Mac laptop or desktop, or something similar for other brands with fixed components over a period of time (2016 in this example). Performance comparisons of this type are valuable to users of a commonly available system.
添加 (GPU 2016:CPU 2013) 以显示所比较组件的硬件发布年份(本例中为 2016 年与 2013 年)可能存在不匹配。
Add (GPU 2016:CPU 2013) to show that there is a possible mismatch in the hardware release year (2016 versus 2013 in this example) of the components being compared.
No qualifications added to comparison numbers. Who knows what the numbers mean?
由于 CPU 和 GPU 模型的爆炸式增长,性能数据必然更多地是苹果和橙子之间的比较,而不是一个定义明确的指标。但对于更正式的设置,我们至少应该指出比较的性质,以便其他人更好地了解数字的含义,并对硬件供应商更加公平。
Because of the explosion in CPU and GPU models, performance numbers will necessarily be more of a comparison between apples and oranges rather than a well-defined metric. But for more formal settings, we should at least indicate the nature of the comparison so that others have a better idea of the meaning of the numbers and to be more fair to the hardware vendors.
本书在编写时充分考虑了应用程序代码开发人员的需求,不假定您具备并行计算方面的知识。您应该只是希望提高应用程序的性能和可伸缩性。应用领域包括科学计算、机器学习以及从台式机到大型超级计算机上的大数据分析。
This book is written with the application code developer in mind and no previous knowledge of parallel computing is assumed. You should simply have a desire to improve the performance and scalability of your application. The application areas include scientific computing, machine learning, and analysis of big data on systems ranging from a desktop to the largest supercomputers.
要从本书中充分受益,读者应该是熟练的程序员,最好使用编译的 HPC 语言,如 C、C++ 或 Fortran。我们还假设对硬件架构有基本的了解。此外,读者应该熟悉计算机技术术语,例如位、字节、操作、缓存、RAM 等。对操作系统的功能以及它如何管理和与硬件组件交互也有基本的了解也很有帮助。阅读本书后,您将获得的一些技能包括
To fully benefit from this book, readers should be proficient programmers, preferably with a compiled, HPC language such as C, C++, or Fortran. We also assume a rudimentary knowledge of hardware architectures. In addition, readers should be comfortable with computer technology terms such as bits, bytes, ops, cache, RAM, etc. It is also helpful to have a basic understanding of the functions of an OS and how it manages and interfaces with the hardware components. After reading this book, some of the skills you will gain include
Determining when message passing (MPI) is more appropriate than threading (OpenMP) and vice-versa
Discerning which sections of your application have the most potential for speedup
Deciding when it might be beneficial to leverage a GPU to accelerate your application
Establishing what is the peak potential performance for your application
即使在第一章之后,您也应该对并行编程的不同方法感到满意。我们建议您完成每章中的练习,以帮助您整合我们介绍的许多概念。如果您开始对当前并行架构的复杂性感到有点不知所措,那么您并不孤单。掌握所有可能性是具有挑战性的。我们将在以下章节中逐个分解它,以方便您。
Even after this first chapter, you should feel comfortable with the different approaches to parallel programming. We suggest that you work through the exercises in each chapter to help you integrate the many concepts that we present. If you are beginning to feel a little overwhelmed by the complexity of the current parallel architectures, you are not alone. It’s challenging to grasp all the possibilities. We’ll break it down, piece-by-piece, in the following chapters to make it easier for you.
可以在 Lawrence Livermore National Laboratory 网站上找到对并行计算的良好基本介绍:
A good basic introduction to parallel computing can be found on the Lawrence Livermore National Laboratory website:
Blaise Barney,“并行计算简介”。https://computing.llnl.gov/ tutorials/parallel_comp/。
Blaise Barney, “Introduction to Parallel Computing.” https://computing.llnl.gov/ tutorials/parallel_comp/.
在您的日常生活中还有哪些并行操作的例子?您将如何对示例进行分类?并行设计似乎针对什么进行了优化?您能否计算此示例的并行加速比?
What are some other examples of parallel operations in your daily life? How would you classify your example? What does the parallel design appear to optimize for? Can you compute a parallel speedup for this example?
For your desktop, laptop, or cell phone, what is the theoretical parallel processing power of your system in comparison to its serial processing power? What kinds of parallel hardware are present in it?
您在图 1.1 的 store checkout 示例中看到了哪些并行策略?是否有一些未显示的现有并行策略?在练习 1 中的例子中怎么样?
Which parallel strategies do you see in the store checkout example in figure 1.1? Are there some present parallel strategies that are not shown? How about in your examples from exercise 1?
您有一个图像处理应用程序,它每天需要处理 1000 张图像,每张图像的大小为 4 兆字节(MiB、220 或 1048576 字节)。串行处理每张图像需要 10 分钟。您的集群由具有 16 个内核的多核节点组成,每个节点总共有 16 GB(GiB、230 字节或 1024 MB)的主内存存储。(请注意,我们使用适当的二进制术语 MiB 和 GiB,而不是 MB 和 GB,它们分别是 106 字节和 109 字节的度量术语。
You have an image-processing application that needs to process 1,000 images daily, which are 4 mebibytes (MiB, 220 or 1,048,576 bytes) each in size. It takes 10 min in serial to process each image. Your cluster is composed of multi-core nodes with 16 cores and a total of 16 gibibytes (GiB, 230 bytes, or 1024 mebibytes) of main memory storage per node. (Note that we use the proper binary terms, MiB and GiB, rather than MB and GB, which are the metric terms for 106 and 109 bytes, respectively.)
Intel Xeon E5-4660 处理器的热设计功率为 130 W;这是使用所有 16 个内核时的平均功耗率。NVIDIA 的 Tesla V100 GPU 和 AMD 的 MI25 Radeon GPU 的热设计功率为 300 W。假设您将软件移植为使用这些 GPU 之一。您的应用程序在 GPU 上的运行速度应该比 16 核 CPU 应用程序高得多?
An Intel Xeon E5-4660 processor has a thermal design power of 130 W; this is the average power consumption rate when all 16 cores are used. NVIDIA’s Tesla V100 GPU and AMD’s MI25 Radeon GPU have a thermal design power of 300 W. Suppose you port your software to use one of these GPUs. How much faster should your application run on the GPU to be considered more energy efficient than your 16-core CPU application?
Because this is an era where most of the compute capabilities of hardware are only accessible through parallelism, programmers should be well versed in the techniques used to exploit parallelism.
Applications must have parallel work. The most important job of a parallel programmer is to expose more parallelism.
Improvements to hardware are nearly all-enhancing parallel components. Relying on increasing serial performance will not result in future speedup. The key to increasing application performance will all be in the parallel realm.
A variety of parallel software languages are emerging to help access the hardware capabilities. Programmers should know which are suitable for different situations.
开发并行应用程序或使现有应用程序并行运行起初可能会让人感到具有挑战性。通常,刚接触并行的开发人员不确定从哪里开始以及他们可能会遇到什么陷阱。本章重点介绍用于开发并行应用程序的工作流模型,如图 2.1 所示。此模型为从何处开始以及如何保持开发并行应用程序的进度提供了上下文。通常,最好以较小的增量实现并行性,以便在遇到问题时可以撤消最后几次提交。这种模式适用于敏捷项目管理技术。
Developing a parallel application or making an existing application run in parallel can feel challenging at first. Often, developers new to parallelism are unsure of where to begin and what pitfalls they might encounter. This chapter focuses on a workflow model for developing parallel applications as illustrated in figure 2.1. This model provides the context for where to get started and how to maintain progress in developing your parallel application. Generally, it is best to implement parallelism in small increments so that if problems are encountered, the last few commits can be reversed. This kind of pattern is suited to agile project management techniques.
图 2.1 我们建议的并行开发工作流程从准备应用程序开始,然后重复四个步骤以增量并行化应用程序。此工作流特别适合敏捷项目管理技术。
Figure 2.1 Our suggested parallel development workflow begins with preparing the application and then repeating four steps to incrementally parallelize an application. This workflow is particularly suited to an agile project management technique.
假设您被分配了一个新项目,用于加速和并行化图 1.9 所示空间网格中的应用程序(喀拉喀托火山示例)。这可以是图像检测算法、火山灰羽流的科学模拟或产生的海啸波模型,或者这三种方法。您可以采取哪些步骤来获得成功的并行项目?
Let’s imagine that you have been assigned a new project to speed up and parallelize an application from the spatial mesh presented in figure 1.9 (the Krakatau volcano example). This could be an image detection algorithm, a scientific simulation of the ash plume, or a model of the resulting tsunami waves, or all three of these. What steps can you take to have a successful parallelism project?
直接跳入项目是很诱人的。但是如果没有思考和准备,你成功的机会就会大大降低。首先,您需要为这项并行工作制定一个项目计划,因此我们首先简要概述此工作流程中的步骤。然后,随着本章的进展,我们将更深入地研究每个步骤,重点介绍并行项目的典型特征。
It is tempting to just jump into the project. But without thought and preparation, you greatly reduce your chance of success. As a start, you will need a project plan for this parallelism effort, so we begin here with a high-level overview of the steps in this workflow. Then we’ll dive deeper into each step as this chapter progresses, with a focus on the characteristics typical for a parallel project.
图 2.2 显示了准备步骤中推荐的组件。这些是被证明对并行化项目特别重要的项目。
Figure 2.2 presents the recommended components in the preparation step. These are the items proven to be important specifically for parallelization projects.
图 2.2 推荐的准备组件解决了对并行代码开发很重要的问题。
Figure 2.2 The recommended preparation components address issues that are important for parallel code development.
在此阶段,您需要设置版本控制,为您的应用程序开发测试套件,并清理现有代码。版本控制允许您跟踪随时间推移对应用程序所做的更改。它允许您在以后快速撤消错误并跟踪代码中的错误。测试套件允许您通过对代码所做的每次更改来验证应用程序的正确性。当与版本控制结合使用时,这可能是快速开发应用程序的强大设置。
At this stage, you will need to set up version control, develop a test suite for your application, and clean up existing code. Version control allows you to track the changes you make to your application over time. It permits you to quickly undo mistakes and track down bugs in your code at a later date. A test suite allows you to verify the correctness of your application with each change that is made to your code. When coupled with version control, this can be a powerful setup for rapidly developing your application.
有了版本控制和代码测试,您现在可以处理清理代码的任务。好的代码易于修改和扩展,并且不会表现出不可预知的行为。通过模块化和内存问题检查,可以确保良好、干净的代码。模块化意味着您将内核实现为具有明确定义的输入和输出的独立子例程或函数。内存问题可能包括内存泄漏、越界内存访问和使用未初始化的内存。使用可预测的高质量代码开始并行工作,可以促进快速进度和可预测的开发周期。如果原始结果是由于编程错误造成的,则很难匹配您的序列号。
With version control and code testing in place, you can now tackle the task of cleaning up your code. Good code is easy to modify and extend, and does not exhibit unpredictable behavior. Good, clean code can be ensured with modularity and checks for memory issues. Modularity means that you implement kernels as independent subroutines or functions with well-defined input and output. Memory issues can include memory leaks, out-of-bounds memory access, and use of uninitialized memory. Starting your parallelism work with predictable and quality code promotes rapid progress and predictable development cycles. It is hard to match your serial code if the original results are due to a programming error.
最后,您需要确保您的代码是可移植的。这意味着多个编译器可以编译您的代码。拥有并维护编译器可移植性允许您的应用程序面向您当前可能考虑的平台之外的其他平台。此外,经验表明,开发使用多个编译器的代码有助于在将错误提交到代码的版本历史记录之前发现这些错误。随着高性能计算环境的快速变化,可移植性使您能够更快地适应变化。
Finally, you will want to make sure your code is portable. This means that multiple compilers can compile your code. Having and maintaining compiler portability allows your application to target additional platforms, beyond the one you may currently have in mind. Further, experience shows that developing code to work with multiple compilers helps to find bugs before these are committed to your code’s version history. With the high performance computing landscape changing rapidly, portability allows you to adapt to changes much quicker down the line.
准备时间与实际并行性所花费的时间相当,这并不罕见,尤其是对于复杂代码。将此准备工作包含在您的项目范围和时间估计中,可以避免对项目进度感到沮丧。在本章中,我们假设您从串行或原型应用程序开始。但是,即使您已经开始并行化代码,您仍然可以从此工作流程策略中受益。接下来,我们讨论项目准备的四个组成部分。
It is not unusual that the preparation time rivals that spent on the actual parallelism, especially for complex code. Including this preparation in your project scope and time estimates avoids frustrations with your project’s progress. In this chapter, we assume that you are starting from a serial or prototype application. However, you can still benefit from this workflow strategy even if you’ve already started parallelizing your code. Next, we discuss the four components of project preparation.
在并行过程中发生的许多更改不可避免地会突然发现代码已损坏或返回不同的结果。能够通过备份到工作版本从这种情况中恢复至关重要。
It is inevitable with the many changes that occur during parallelism that you will suddenly find the code is broken or returning different results. Being able to recover from this situation by backing up to a working version is critically important.
注意在开始任何并行工作之前,请检查您的应用程序有哪些类型的版本控制。
Note Check to see what kind of version control is in place for your application before beginning any parallelism work.
对于我们场景中的图像检测项目,您会发现已经有一个版本控制系统。但是灰烬羽流模型从来没有任何版本控制。随着您深入挖掘,您会发现各种开发人员目录中实际上有四个版本的灰烬羽流代码。当有版本控制系统正在运行时,您可能希望查看您的团队用于日常操作的流程。也许团队认为切换到 “pull request” 模型是个好主意,在提交之前,更改会发布供其他团队成员审查。或者您和您的团队可能会觉得 “push” 模型的直接提交与快速、小规模的并行任务提交更兼容。在推送模型中,提交直接发送到存储库,无需审核。在我们没有版本控制的 ash flue 应用程序示例中,当务之急是获得一些东西来驯服开发人员之间不受控制的代码分歧。
For your image detection project in our scenario, you find that there is already a version control system in place. But the ash plume model never had any version control. As you dig deeper, you find that there are actually four versions of the ash plume code in various developer’s directories. When there is a version control system in operation, you may want to review the processes your team uses for day-to-day operations. Perhaps the team thinks it is a good idea to switch to a “pull request” model, where changes are posted for review by other team members before being committed. Or you and your team may feel that the direct commit of the “push” model is more compatible with the rapid, small commits of parallelism tasks. In the push model, commits are made directly to the repository without review. In our example of the ash plume application without version control, the priority is to get something in place to tame the uncontrolled divergence of code among developers.
版本控制有很多选项。如果您没有其他偏好,我们建议您使用 Git,这是最常见的分布式版本控制系统。分布式版本控制系统是一种允许使用多个存储库数据库的系统,而不是集中式版本控制中使用的单个集中式系统。分布式版本控制对于开源项目以及开发人员在笔记本电脑上工作、在远程位置工作或未连接到网络或靠近中央存储库的其他情况非常有利。在当今的开发环境中,这是一个巨大的优势。但它也带来了额外的复杂性。集中式版本控制仍然很流行,并且更适合企业环境,因为只有一个地方存在有关源代码的所有信息。集中控制还为专有软件提供了更好的安全性和保护。
There are many options for version control. If you have no other preferences, we would suggest Git, the most common distributed version control system. A distributed version control system is one that allows multiple repository databases, rather than a single centralized system used in centralized version control. Distributed version control is advantageous for open source projects and where developers work on laptops, in remote locations, or other situations where they are not connected to a network or close to the central repository. In today’s development environment, this is a huge advantage. But it comes with the cost of additional complexity. Centralized version control is still popular and more appropriate for the corporate environment because there is only one place where all the information about the source code exists. Centralized control also provides better security and protection for proprietary software.
有许多关于如何使用 Git 的好书、博客和其他资源;我们在本章末尾列出了一些。我们还在第 17 章中列出了一些其他常见的版本控制系统。这些系统包括免费的分布式版本控制系统(如 Mercurial 和 Git)、商业系统(如 PerForce 和 ClearCase)以及用于集中式版本控制的 CVS 和 SVN。无论您使用哪种系统,您和您的团队都应该经常提交。以下方案在并行任务中尤其常见:
There are many good books, blogs, and other resources on how to use Git; we list a few at the end of the chapter. We also list some other common version control systems in chapter 17. These include free distributed version control systems such as Mercurial and Git, commercial systems such as PerForce and ClearCase, and for centralized version control, CVS and SVN. Regardless of which system you use, you and your team should commit frequently. The following scenario is especially common with parallelism tasks:
这种情况发生在我身上太频繁了。所以我试图通过定期提交来避免这个问题。
This happens to me far too often. So I try to avoid the problem by committing regularly.
提示如果你不希望主仓库中有大量的小提交,你可以使用一些版本控制系统(如 Git)来折叠提交,或者你可以为自己维护一个临时的版本控制系统。
Tip If you do not want lots of small commits in the main repository, you can collapse the commits with some version control systems such as Git, or you can maintain a temporary version control system just for yourself.
提交消息是提交作者可以传达正在处理的任务以及进行某些更改的原因的地方,无论是针对自己还是针对当前或未来的团队成员。每个团队对这些信息的详细程度都有自己的偏好;我们建议在提交消息中使用尽可能多的细节。这是您今天通过勤奋来避免以后困惑的机会。
The commit message is where the commit author can communicate what task is being addressed and why certain changes were made, whether for self or for current or future team members. Every team has their own preference for how detailed these messages should be; we recommend using as much detail as possible in your commit messages. This is your opportunity to save yourself from later confusion by being diligent today.
通常,提交消息包括摘要和正文。摘要提供了一个简短的声明,清楚地表明提交涵盖了哪些新的更改。此外,如果您使用问题跟踪系统,则摘要行将引用该系统中的问题编号。最后,正文包含提交背后的大部分 “why” 和 “how”。
In general, commit messages include a summary and a body. The summary provides a short statement indicating clearly what new changes the commit covers. Additionally, if you use an issue tracking system, the summary line will reference an issue number from that system. Finally, the body contains most of the “why” and “how” behind the commit.
有了版本控制计划,并且至少就您团队的开发流程达成了粗略的协议,我们已准备好进行下一步。
With a plan for version control and at least a rough agreement on your team’s development processes, we are ready to move on to the next step.
测试套件是一组问题,用于执行应用程序的各个部分以保证代码的相关部分仍然有效。测试套件对于除最简单的代码之外的所有代码都是必需的。对于每次更改,您都应该进行测试以查看您获得的结果是否相同。这听起来很简单,但某些代码在编译器和处理器数量不同的情况下可能会获得略有不同的结果。
A test suite is a set of problems that exercise parts of an application to guarantee that related parts of the code still work. Test suites are a necessity for all but the simplest of codes. With each change, you should test to see that the results that you get are the same. This sounds simple, but some code can reach slightly different results with different compilers and numbers of processors.
在以下部分中,我们将讨论为什么会出现此类差异,如何确定哪些变体是合理的,以及如何设计测试以在将实际错误提交到存储库之前捕获这些错误。
In the following sections, we’ll discuss why such differences can arise, how to determine which variations are reasonable, and how to design tests that catch real bugs before these are committed to your repository.
Understanding changes in results due to parallelism
并行过程本身会更改运算顺序,这会略微修改数值结果。但并行性中的误差也会产生微小的差异。在并行代码开发中理解这一点至关重要,因为我们需要与单个处理器运行进行比较,以确定我们的并行编码是否正确。我们将在 5.7 节讨论全局和的技术时讨论一种减少数值误差的方法,以便并行误差更加明显。
The parallelism process inherently changes the order of operations, which slightly modifies the numerical results. But errors in parallelism also generate small differences. This is crucial to understand in parallel code development because we need to compare to a single processor run to determine if our parallelism coding is correct. We’ll discuss a way to reduce the numerical errors so that the parallelism errors are more obvious in section 5.7, when we discuss techniques for global sums.
对于我们的测试套件,我们需要一个工具来比较对差异容差较小的数值字段。过去,测试套件开发人员必须为此目的创建一个工具,但近年来市场上出现了一些数字差异实用程序。两个这样的工具是
For our test suite, we will need a tool that compares numerical fields with a small tolerance for differences. In the past, test suite developers would have to create a tool for this purpose, but a few numerical diff utilities have appeared on the market in recent years. Two such tools are
Numdiff 与 https://www.nongnu.org/numdiff/
Numdiff from https://www.nongnu.org/numdiff/
或者,如果您的代码在 HDF5 或 NetCDF 文件中输出其状态,则这些格式附带了一些实用程序,允许您比较存储在具有不同容差的文件中的值。
Alternatively, if your code outputs its state in HDF5 or NetCDF files, these formats come with utilities that allow you to compare values stored in the files with varying tolerances.
HDF5® 是最初称为 Hierarchical Data Format(现在称为 HDF)的软件的第 5 版。它可以从 The HDF Group (https://www .hdfgroup.org/) 免费获得,是用于输出大数据文件的常用格式。
HDF5® is version 5 of the software originally known as Hierarchical Data Format, now called HDF. It is freely available from The HDF Group (https://www .hdfgroup.org/) and is a common format used to output large data files.
NetCDF 或网络通用数据表单是气候和地球科学界使用的另一种格式。NetCDF 的当前版本建立在 HDF5 之上。您可以在 Unidata Program Center 的网站 (https://www.unidata.ucar.edu/software/netcdf/) 上找到这些库和数据格式。
NetCDF or the Network Common Data Form is an alternate format used by the climate and geosciences community. Current versions of NetCDF are built on top of HDF5. You can find these libraries and data formats at the Unidata Program Center’s website (https://www.unidata.ucar.edu/software/netcdf/).
这两种文件格式都使用二进制数据来提高速度和效率。二进制数据是数据的机器表示形式。这种格式对你我来说只是胡言乱语,但 HDF5 有一些有用的实用程序,可以让我们查看里面的内容。h5ls 实用程序列出了文件中的对象,例如所有数据数组的名称。h5dump 实用程序转储每个对象或数组中的数据。最重要的是,对于我们在这里的目的,h5diff 实用程序比较两个 HDF 文件并报告高于数字容差的差异。HDF5 和 NetCDF 以及其他并行输入/输出 (I/O) 主题将在第 16 章中更详细地讨论。
Both of these file formats use binary data for speed and efficiency. Binary data is the machine representation of the data. This format just looks like gibberish to you and me, but HDF5 has some useful utilities that allow us to look at what’s inside. The h5ls utility lists the objects in the file, such as the names of all the data arrays. The h5dump utility dumps the data in each object or array. And most importantly for our purposes here, the h5diff utility compares two HDF files and reports the difference above a numeric tolerance. HDF5 and NetCDF along with other parallel input/output (I/O) topics will be discussed in more detail in chapter 16.
Using CMake and CTest to automatically test your code
近年来,许多测试系统已经问世。这包括 CTest、Google 测试、pFUnit 测试等。您可以在第 17 章中找到有关工具的更多信息。现在,让我们看看使用 CTest 和 ndiff 创建的系统。
Many testing systems have become available in recent years. This includes CTest, Google test, pFUnit test, and others. You can find more information on tools in chapter 17. For now, let’s look at a system created using CTest and ndiff.
CTest 是 CMake 系统的一个组件。CMake 是一个配置系统,它将生成的 makefile 适应不同的系统和编译器。将 CTest 测试系统整合到 CMake 中,将两者紧密耦合成一个统一的系统。这为开发人员提供了很多便利。使用 CTest 实施测试的过程相对简单。各个测试被编写为任意命令序列。要将这些合并到 CMake 系统中,需要在 CMakeLists.txt中添加以下内容:
CTest is a component of the CMake system. CMake is a configuration system that adapts generated makefiles to different systems and compilers. Incorporating the CTest testing system into CMake couples the two tightly together into a unified system. This provides a lot of convenience to the developer. The process of implementing tests using CTest is relatively easy. The individual tests are written as any sequence of commands. To incorporate these into the CMake system requires adding the following to the CMakeLists.txt:
add_test(<testname> <executable name> <arguments to executable>)
add_test(<testname> <executable name> <arguments to executable>)
然后,您可以使用 make test、ctest 调用测试,也可以使用 ctest -R mpi 选择单个测试,其中 mpi 是运行任何匹配测试名称的正则表达式。让我们来看一个使用 CTest 系统创建测试的示例。
Then you can invoke the tests with make test, ctest, or you can select individual tests with ctest -R mpi, where mpi is a regular expression that runs any matching test names. Let’s just walk through an example of creating a test using the CTest system.
如清单 2.1 所示,创建两个源文件,为这个简单的测试系统创建应用程序。我们将使用计时器在串行程序和并行程序的输出中产生微小的差异。请注意,您可以在 https://github.com/EssentialsofParallelComputing/Chapter2 中找到本章的源代码。
Make two source files as shown in listing 2.1 to create applications for this simple testing system. We’ll use a timer to produce small differences in output from both a serial and a parallel program. Note that you’ll find the source code for this chapter at https://github.com/EssentialsofParallelComputing/Chapter2.
Listing 2.1 Simple timing programs for demonstrating the testing system
C Program, TimeIt.c
1 #include <unistd.h>
2 #include <stdio.h>
3 #include <time.h>
4 int main(int argc, char *argv[]){
5 struct timespec tstart, tstop, tresult;
6 clock_gettime(CLOCK_MONOTONIC, &tstart); ❶
7 sleep(10); ❶
8 clock_gettime(CLOCK_MONOTONIC, &tstop); ❶
9 tresult.tv_sec =
tstop.tv_sec - tstart.tv_sec; ❷
10 tresult.tv_usec =
tstop.tv_nsec - tstart.tv_nsec; ❷
11 printf("Elapsed time is %f secs\n",
(double)tresult.tv_sec + ❸
12 (double)tresult.tv_nsec*1.0e-9); ❸
13 }
MPI Program, MPITimeIt.c
1 #include <unistd.h>
2 #include <stdio.h>
3 #include <mpi.h>
4 int main(int argc, char *argv[]){
5 int mype;
6 MPI_Init(&argc, &argv); ❹
7 MPI_Comm_rank(MPI_COMM_WORLD, &mype); ❹
8 double t1, t2;
9 t1 = MPI_Wtime(); ❺
10 sleep(10); ❺
11 t2 = MPI_Wtime(); ❺
12 if (mype == 0)
printf( "Elapsed time is %f secs\n", ❻
t2 - t1); ❻
13 MPI_Finalize(); ❼
14 }C Program, TimeIt.c
1 #include <unistd.h>
2 #include <stdio.h>
3 #include <time.h>
4 int main(int argc, char *argv[]){
5 struct timespec tstart, tstop, tresult;
6 clock_gettime(CLOCK_MONOTONIC, &tstart); ❶
7 sleep(10); ❶
8 clock_gettime(CLOCK_MONOTONIC, &tstop); ❶
9 tresult.tv_sec =
tstop.tv_sec - tstart.tv_sec; ❷
10 tresult.tv_usec =
tstop.tv_nsec - tstart.tv_nsec; ❷
11 printf("Elapsed time is %f secs\n",
(double)tresult.tv_sec + ❸
12 (double)tresult.tv_nsec*1.0e-9); ❸
13 }
MPI Program, MPITimeIt.c
1 #include <unistd.h>
2 #include <stdio.h>
3 #include <mpi.h>
4 int main(int argc, char *argv[]){
5 int mype;
6 MPI_Init(&argc, &argv); ❹
7 MPI_Comm_rank(MPI_COMM_WORLD, &mype); ❹
8 double t1, t2;
9 t1 = MPI_Wtime(); ❺
10 sleep(10); ❺
11 t2 = MPI_Wtime(); ❺
12 if (mype == 0)
printf( "Elapsed time is %f secs\n", ❻
t2 - t1); ❻
13 MPI_Finalize(); ❼
14 }
❶ Starts timer, calls sleep, then stops the timer
❷ Timer has two values for resolution and to prevent overflows.
❹ Initializes MPI and gets processor rank
❺ Starts timer, calls sleep, then stops the timer
❻ Prints timing output from first processor
现在,您需要一个测试脚本来运行应用程序并生成几个不同的输出文件。在这些运行之后,应该对输出进行数值比较。下面是一个过程示例,您可以将其放入名为 mympiapp.ctest 的文件中。你应该执行 chmod +x 来使其可执行。
Now you need a test script that runs the applications and produces a few different output files. After these run, there should be numerical comparisons of the output. Here is an example of the process you can put in a file called mympiapp.ctest. You should do a chmod +x to make it executable.
mympiapp.ctest 1 #!/bin/sh 2 ./TimeIt > run0.out ❶ 3 mpirun -n 1 ./MPITimeIt > run1.out ❷ 4 mpirun -n 2 ./MPITimeIt > run2.out ❸ 5 ndiff --relative-error 1.0e-4 run1.out run2.out ❹ 6 test1=$? ❺ 7 ndiff --relative-error 1.0e-4 run0.out run2.out ❻ 8 test2=$? ❻ 9 exit "$(($test1+$test2))" ❼
mympiapp.ctest 1 #!/bin/sh 2 ./TimeIt > run0.out ❶ 3 mpirun -n 1 ./MPITimeIt > run1.out ❷ 4 mpirun -n 2 ./MPITimeIt > run2.out ❸ 5 ndiff --relative-error 1.0e-4 run1.out run2.out ❹ 6 test1=$? ❺ 7 ndiff --relative-error 1.0e-4 run0.out run2.out ❻ 8 test2=$? ❻ 9 exit "$(($test1+$test2))" ❼
❷ Runs the first MPI test on 1 processor
❸ Runs the second MPI test on 2 processors
❹ Compares the output for the two MPI jobs to get the test to fail
❺ Captures the status set by the ndiff command
❻ Compares the serial output to the 2 processor run
❼ 使用累积状态代码退出,以便 CTest 可以报告通过或失败
❼ Exits with the cumulative status code so CTest can report pass or fail
此测试首先比较第 5 行具有 1 个和 2 个处理器的并行作业的输出,容差为 0.1%。然后,它将串行运行与第 7 行的 2 处理器并行作业进行比较。要使测试失败,请尝试将容差降低到 1.0e-5。CTest 使用第 9 行的退出代码来报告通过或失败。将一组 CTest 文件添加到测试套件中的最简单方法是使用一个循环,该循环查找所有以 .ctest 结尾的文件,并将这些文件添加到 CTest 列表中。下面是一个 CMakeLists.txt 文件的示例,其中包含创建两个应用程序的附加说明:
This test first compares the output for a parallel job with 1 and 2 processors with a tolerance of 0.1% on line 5. Then it compares the serial run to the 2 processor parallel job on line 7. To get the tests to fail, try reducing the tolerance to 1.0e-5. CTest uses the exit code on line 9 to report pass or fail. The simplest way to add a bunch of CTest files to the test suite is to use a loop that finds all the files ending in .ctest and adds these to the CTest list. Here is an example of a CMakeLists.txt file with the additional instructions to create the two applications:
CMakeLists.txt 1 cmake_minimum_required (VERSION 3.0) 2 project (TimeIt) 3 4 enable_testing() ❶ 5 6 find_package(MPI) ❷ 7 8 add_executable(TimeIt TimeIt.c) ❸ 9 10 add_executable(MPITimeIt MPITimeIt.c) ❸ 11 target_include_directories(MPITimeIt PUBLIC. ❹ ${MPI_INCLUDE_PATH}) ❹ 12 target_link_libraries(MPITimeIt ${MPI_LIBRARIES}) ❹ 13 14 file(GLOB TESTFILES RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*.ctest") ❺ 15 foreach(TESTFILE ${TESTFILES}) ❺ 16 add_test(NAME ${TESTFILE} WORKING_DIRECTORY ${CMAKE_BINARY_DIR} ❺ 17 COMMAND sh ${CMAKE_CURRENT_SOURCE_DIR}/${TESTFILE}) ❺ 18 endforeach() ❺ 19 20 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles ❻ 21 CTestTestfile.cmake Makefile Testing cmake_install.cmake) ❻
CMakeLists.txt 1 cmake_minimum_required (VERSION 3.0) 2 project (TimeIt) 3 4 enable_testing() ❶ 5 6 find_package(MPI) ❷ 7 8 add_executable(TimeIt TimeIt.c) ❸ 9 10 add_executable(MPITimeIt MPITimeIt.c) ❸ 11 target_include_directories(MPITimeIt PUBLIC. ❹ ${MPI_INCLUDE_PATH}) ❹ 12 target_link_libraries(MPITimeIt ${MPI_LIBRARIES}) ❹ 13 14 file(GLOB TESTFILES RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*.ctest") ❺ 15 foreach(TESTFILE ${TESTFILES}) ❺ 16 add_test(NAME ${TESTFILE} WORKING_DIRECTORY ${CMAKE_BINARY_DIR} ❺ 17 COMMAND sh ${CMAKE_CURRENT_SOURCE_DIR}/${TESTFILE}) ❺ 18 endforeach() ❺ 19 20 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles ❻ 21 CTestTestfile.cmake Makefile Testing cmake_install.cmake) ❻
❶ Enables CTest functionality in CMake
❷ CMake built-in routine to find most MPI packages
❸ 添加 TimeIt 和 MPITimeIt 构建目标及其源代码文件
❸ Adds TimeIt and MPITimeIt build targets with their source code files
❹ Needs an include path to the mpi.h file and to the MPI library
❺ 获取扩展名为 .ctest 的所有文件,并将这些文件添加到 CTest 的测试列表中
❺ Gets all files with the extension .ctest and adds those to the test list for CTest
❻ A custom command, distclean, removes created files.
第 6 行的 find_package(MPI) 命令定义 MPI_FOUND、MPI_INCLUDE_ PATH 和 MPI_LIBRARIES。这些变量包括 MPI_<lang>_INCLUDE_PATH 和 MPI_<lang>_LIBRARIES 的较新 CMake 版本中的语言,因此 C、C++ 和 Fortran 具有不同的路径。现在剩下的就是使用
The find_package(MPI) command on line 6 defines MPI_FOUND, MPI_INCLUDE_ PATH, and MPI_LIBRARIES. These variables include the language in newer CMake versions of MPI_<lang>_INCLUDE_PATH, and MPI_<lang>_LIBRARIES so that there are different paths for C, C++, and Fortran. Now all that remains is to run the test with
mkdir build && cd build cmake .. make make test
mkdir build && cd build cmake .. make make test
ctest
ctest
You can also get the output for failed tests with
ctest --output-on-failure
ctest --output-on-failure
You should get some results like the following:
Running tests...
Test project /Users/brobey/Programs/RunDiff
Start 1: mpitest.ctest
1/1 Test #1: mpitest.ctest .................... Passed 30.24 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 30.24 secRunning tests...
Test project /Users/brobey/Programs/RunDiff
Start 1: mpitest.ctest
1/1 Test #1: mpitest.ctest .................... Passed 30.24 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 30.24 sec
此测试基于 sleep 函数和计时器,因此它可能会通过也可能不会通过。测试结果位于 Testing/Temporary/* 中。
This test is based on the sleep function and timers, so it may or may not pass. Test results are in Testing/Temporary/*.
在此测试中,我们比较了应用程序的各个运行之间的输出。最好将其中一个运行的黄金标准文件与测试脚本一起存储,以便进行比较。此比较可检测将导致新版本的应用程序获得与早期版本不同的结果的更改。当这种情况发生时,它是一个危险信号;检查新版本是否仍然正确。如果是这样,您应该更新黄金标准。
In this test, we compared the output between individual runs of the application. It is also good practice to store a gold standard file from one of the runs along with the test script to compare against as well. This comparison detects changes that will cause a new version of the application to get different results than earlier versions. When this happens, it is a red flag; check if the new version is still correct. If so, you should update the gold standard.
您的测试套件应尽可能多地执行代码部分。指标代码覆盖率量化了测试套件完成其任务的能力,它表示为源代码行的百分比。测试开发人员有一句老话说,代码中没有测试的部分是坏的,因为即使现在没有,它最终也会坏。在并行化代码时进行的所有更改都不可避免地会损坏。虽然高代码覆盖率很重要,但对于我们的并行工作来说,对要并行化的代码部分进行测试更为重要。许多编译器都能够生成代码覆盖率统计信息。对于 GCC,gcov 是分析工具,对于 Intel,它是 Codecov。我们将看看这对 GCC 是如何工作的。
Your test suite should exercise as many parts of the code as is practical. The metric code coverage quantifies how well the test suite does its task, which is expressed as a percentage of the lines of source code. There is an old saying from test developers that the part of the code that doesn’t have a test is broken because even if it isn’t now, it will be eventually. With all of the changes made when parallelizing code, breakage is inevitable. While high code coverage is important, for our parallelism efforts, it is more critical that there are tests for the parts of the code you are parallelizing. Many compilers have the capability to generate code coverage statistics. For GCC, gcov is the profiling tool, and for Intel, it is Codecov. We’ll take a look at how this works for GCC.
Understanding the different kinds of code tests
There are also different kinds of testing systems. In this section, we’ll cover the following types:
Regression tests—Run at regular intervals to keep the code from backsliding. This is typically done nightly or weekly using the cron job scheduler that launches jobs at specified times.
Unit tests—Tests the operation of subroutines or other small parts of code during development.
Continuous integration tests—Gaining in popularity, these tests are automatically triggered to run by a commit to the code.
Commit tests—A small set of tests that can be run from the command line in a fairly short time and are used before commits.
所有这些测试类型对于项目都很重要,而不是仅仅依赖一种,而应该一起使用这些类型,如图 2.3 所示。测试对于并行应用程序尤为重要,因为在开发周期的早期检测到错误意味着您不会在运行 6 小时后调试 1000 个处理器。
All of these testing types are important for a project and, rather than just relying on one, these should be used together as figure 2.3 illustrates. Testing is particularly important for parallel applications because detecting bugs earlier in the development cycle means that you are not debugging 1,000 processors 6 hours into a run.
图 2.3 不同的测试类型涉及代码开发的不同部分,以创建随时可以发布的高质量代码。
Figure 2.3 The different test types address different parts of code development to create a high quality code that is always ready to release.
单元测试最好在开发代码时创建。单元测试的真正爱好者使用测试驱动开发 (TDD),其中首先创建测试,然后编写代码以通过这些测试。将这些类型的测试合并到并行代码开发中包括测试它们在并行语言中的操作和实现。识别此级别的问题要容易得多。
Unit tests are best created as you develop the code. True aficionados of unit tests use test-driven development (TDD), where the tests are created first and then the code is written to pass these. Incorporating these type of tests into parallel code development includes testing their operation in the parallel language and implementation. Identifying problems at this level is far easier to resolve.
提交测试是您应该添加到项目中以在代码修改阶段大量使用的第一批测试。这些测试应执行代码中的所有例程。通过让这些测试随时可用,团队成员可以在提交到存储库之前运行这些测试。我们建议开发人员在提交之前从命令行调用这些测试,例如 Bash 或 Python 脚本或 makefile。
Commit tests are the first tests that you should add to a project to be heavily used in the code modification phase. These tests should exercise all of the routines in the code. By having these tests readily available, team members can run these before making a commit to the repository. We recommend that developers invoke these tests from the command line like a Bash or Python script, or a makefile, prior to a commit.
可以使用 ctest -R commit 运行提交测试,也可以使用 make commit_tests将自定义目标添加到CMakeLists.txt。make test 或 ctest 命令运行所有测试,包括长时间测试,这需要一段时间。commit test 命令会挑选出名称中包含 commit 的测试,以获取一组涵盖关键功能但运行速度稍快的测试。现在,工作流是
The commit tests can be run with ctest -R commit or with the custom target added to the CMakeLists.txt with make commit_tests. A make test or ctest command runs all the tests including the long test, which takes a while. The commit test command picks out the tests with commit in the name to get a set of tests that covers critical functionality but runs a little faster. Now the workflow is
然后重复。持续集成测试通过提交到主代码存储库来调用。这是防止提交错误代码的额外保护措施。这些测试可以与提交测试相同,也可以更广泛。用于这些类型测试的主要持续集成工具包括
And repeat. Continuous integration tests are invoked by a commit to the main code repository. This is an additional guard against committing bad code. The tests can be the same as the commit tests or can be more extensive. Top continuous integration tools for these types of tests are
詹金斯 (https://www.jenkins.io)
Jenkins (https://www.jenkins.io)
适用于 GitHub 和 Bitbucket 的 Travis CI (https://travis-ci.com)
Travis CI for GitHub and Bitbucket (https://travis-ci.com)
GitLab CI (https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/)
GitLab CI (https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/)
CircleCI (https://circleci.com)
CircleCI (https://circleci.com)
回归测试通常设置为通过 cron 作业在夜间运行。这意味着测试套件可以比其他测试类型的测试套件更广泛。这些测试可能会更长,但应在早上报告之前完成。其他测试(如内存检查和代码覆盖率)通常作为回归测试运行,因为运行时间较长且报告周期性较长。回归测试的结果通常会随着时间的推移而被跟踪,“通行证墙”被认为是项目健康状况的指标。
Regression tests are usually set up to run overnight through a cron job. This means that the test suites can be more extensive than the test suites for other testing types. These tests can be longer but should complete by the morning report. Additional tests, such as memory checks and code coverage are often run as regression tests due to the longer run times and the periodicity of the reports. The results of regression tests are often tracked over time and a “wall of passes” is considered as an indication of the project’s well-being.
Further requirements of an ideal testing system
虽然前面描述的测试系统足以满足大多数用途,但对于大型 HPC 项目,还有更多功能可能有所帮助。这些类型的 HPC 项目可能具有广泛的测试套件,并且可能还需要在批处理系统中运行才能访问更大的资源。
While the testing system as described previously is sufficient for most purposes, there is more that can be helpful for larger HPC projects. These types of HPC projects can have extensive test suites and might also need to be run in a batch system to access larger resources.
https://sourceforge.net/projects/ ctsproject/ 的协作测试系统 (CTS) 提供了一个为满足这些需求而开发的系统示例。它使用 Perl 脚本来运行一组固定的测试服务器(通常为 10 个),从而与批处理系统并行启动测试。每个测试完成后,它会启动下一个测试。这避免了一次性用作业淹没系统。CTS 系统还会自动检测批处理系统和 MPI 类型,并调整每个系统的脚本。报告系统使用 cron 作业,并在夜间早期启动测试。跨平台报告在早上启动,然后发送出去。
The Collaborative Testing System (CTS) at https://sourceforge.net/projects/ ctsproject/ provides an example of a system that was developed for these demands. It uses a Perl script to run a fixed set of test servers, typically 10, launching the tests in parallel to a batch system. As each test completes, it launches the next. This avoids flooding the system with jobs all at once. The CTS system also autodetects the batch system and type of MPI and adjusts the scripts for each system. The reporting system uses cron jobs with the tests launched early in the overnight period. The cross-platform report launches in the morning and then is sent out.
良好的代码质量至关重要。并行化通常会导致出现任何代码缺陷;这可能是未初始化的内存或内存覆盖。
Good code quality is paramount. Parallelizing often causes any code flaw to appear; this might be uninitialized memory or memory overwrites.
未初始化的内存是在设置其值之前访问的内存。当您为程序分配内存时,它会获取这些内存位置中的任何值。如果在设置之前使用它,这会导致不可预知的行为。
Uninitialized memory is memory that is accessed before its values are set. When you allocate memory to your program, it gets whatever values are in those memory locations. This leads to unpredictable behavior if it is used before being set.
Memory overwrites occur when data is written to a memory location that isn’t owned by a variable. An example of this is writing past the bounds of an array or string.
要捕获这类问题,我们建议使用内存正确性工具来彻底检查您的代码。其中最好的之一是免费提供的 Valgrind 程序。Valgrind 是一个检测框架,它通过合成 CPU 执行指令,在机器代码级别运行。在 Valgrind 旗下开发了许多工具。第一步是使用包管理器在您的系统上安装 Valgrind。如果您运行的是最新版本的 macOS,您可能会发现 Valgrind 需要几个月的时间才能移植到新内核。最好的办法是在另一台计算机、较旧的 macOS 上运行 Valgrind 或启动虚拟机或 Docker 镜像。
To catch these sorts of problems, we suggest using memory correctness tools to thoroughly check your code. One of the best of these is the freely-available Valgrind program. Valgrind is an instrumentation framework that operates at the machine-code level by executing instructions through a synthetic CPU. There are many tools that have been developed under the Valgrind umbrella. The first step is to install Valgrind on your system using a package manager. If you are running the latest version of macOS, you may find that it takes a few months for Valgrind to be ported to the new kernel. Your best bet for this is to run Valgrind on a different computer, an older macOS or spin up a virtual machine or Docker image.
要运行 Valgrind,请像往常一样执行您的程序,在前面插入 valgrind 命令。对于 MPI 作业,valgrind 命令位于 mpirun 之后和可执行文件名称之前。Valgrind 与 GCC 编译器配合得最好,因为开发团队采用了它,致力于消除可能使诊断输出混乱的误报。建议在使用英特尔编译器时,不进行向量化编译,以避免有关向量指令的警告。您还可以尝试 Section 17.5 中列出的其他内存正确性工具。
To run Valgrind, execute your program as usual, inserting the valgrind command at the front. For MPI jobs, the valgrind command gets placed after mpirun and before your executable name. Valgrind works best with the GCC compiler because that development team adopted it, working to eliminate false positives that can clutter the diagnostic output. It is suggested that when using Intel compilers, compile without vectorization to avoid warnings about the vector instructions. You can also try the other memory correctness tools that are listed in section 17.5.
Using valgrind Memcheck to find memory issues
Memcheck 工具是 Valgrind 工具套件中的默认工具。它拦截每条指令并检查其是否存在各种类型的内存错误,从而在运行开始、运行期间和结束时生成诊断信息。这会将运行速度减慢一个数量级。如果您以前没有使用过它,请为大量输出做好准备。一个内存错误会导致许多其他错误。最好的策略是从第一个错误开始,修复它,然后再次运行。要了解 Valgrind 是如何工作的,请尝试清单 2.2 中的示例代码。要执行 Valgrind,请在可执行文件名称之前插入 valgrind 命令,如
The Memcheck tool is the default tool in the Valgrind tool suite. It intercepts every instruction and checks it for various types of memory errors, generating diagnostics at the start, during, and at the end of the run. This slows down the run by an order of magnitude. If you have not used it before, be prepared for a lot of output. One memory error leads to many others. The best strategy is to start with the first error, fix it, and run again. To see how Valgrind works, try the example code in listing 2.2. To execute Valgrind, insert the valgrind command before the executable name either as
valgrind <./my_app>
valgrind <./my_app>
mpirun -n 2 valgrind <./myapp>
mpirun -n 2 valgrind <./myapp>
Listing 2.2 Example code for Valgrind memory errors
1 #include <stdlib.h>
2
3 int main(int argc, char *argv[]){
4 int ipos, ival; ❶
5 int *iarray = (int *) malloc(10*sizeof(int));
6 if (argc == 2) ival = atoi(argv[1]);
7 for (int i = 0; i<=10; i++){ iarray[i] = ipos; } ❷
8 for (int i = 0; i<=10; i++){
9 if (ival == iarray[i]) ipos = i; ❸
10 }
11 } 1 #include <stdlib.h>
2
3 int main(int argc, char *argv[]){
4 int ipos, ival; ❶
5 int *iarray = (int *) malloc(10*sizeof(int));
6 if (argc == 2) ival = atoi(argv[1]);
7 for (int i = 0; i<=10; i++){ iarray[i] = ipos; } ❷
8 for (int i = 0; i<=10; i++){
9 if (ival == iarray[i]) ipos = i; ❸
10 }
11 }
❷ Loads uninitialized memory from ipos into iarray
使用 gcc -g -o test test.c 编译此代码,然后使用 valgrind—leak-check=full ./test 2 运行它。Valgrind 的输出散布在程序的输出中,可以通过带有双等号 (==) 的前缀来识别。下面显示了此示例输出中一些更重要的部分:
Compile this code with gcc -g -o test test.c and then run it with valgrind—leak-check=full ./test 2. The output from Valgrind is interspersed within the program’s output and can be identified by the prefix with double equal signs (==). The following shows some of the more important parts of the output from this example:
==14324== Invalid write of size 4 ==14324== at 0x400590: main (test.c:7) ==14324== ==14324== Conditional jump or move depends on uninitialized value(s) ==14324== at 0x4005BE: main (test.c:9) ==14324== ==14324== Invalid read of size 4 ==14324== at 0x4005B9: main (test.c:9) ==14324== ==14324== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==14324== at 0x4C29C23: malloc (vg_replace_malloc.c:299) ==14324== by 0x40054F: main (test.c:5)
==14324== Invalid write of size 4 ==14324== at 0x400590: main (test.c:7) ==14324== ==14324== Conditional jump or move depends on uninitialized value(s) ==14324== at 0x4005BE: main (test.c:9) ==14324== ==14324== Invalid read of size 4 ==14324== at 0x4005B9: main (test.c:9) ==14324== ==14324== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==14324== at 0x4C29C23: malloc (vg_replace_malloc.c:299) ==14324== by 0x40054F: main (test.c:5)
此输出显示有关多个内存错误的报告。最难理解的是 uninitialized memory report。Valgrind 在第 9 行报告了使用未初始化的值做出决定时的错误。错误实际上出现在第 7 行,其中 iarray 设置为 ipos,但没有给出值。在更复杂的程序中,可能需要进行一些仔细的分析才能确定错误的来源。
This output displays reports on several memory errors. The trickiest one to understand is the uninitialized memory report. Valgrind reports the error on line 9 when a decision was made with the uninitialized value. The error is actually on line 7 where iarray is set to ipos, which was not given a value. It can take some careful analysis in a more complex program to determine the source of the error.
最后的代码准备要求提高了代码对更广泛的编译器和操作系统的可移植性。可移植性从基本 HPC 语言开始,通常是 C、C++ 或 Fortran。这些语言中的每一种都维护着编译器实现的标准,并且会定期发布新的标准版本。但这并不意味着编译器很容易实现这些。通常,编译器供应商从发布到完全实现的滞后时间可能很长。例如,Polyhedron Solutions 网站 (http://mng.bz/yYne) 报告说,没有 Linux Fortran 编译器完全实现 2008 标准,只有不到一半的编译器完全实现 2003 标准。当然,重要的是编译器是否实现了您想要的功能。C 和 C++ 编译器在实现新标准时通常更为最新,但滞后时间仍然会给激进的开发团队带来问题。此外,即使实现了这些功能,也不意味着这些功能在各种设置中都有效。
A last code preparation requirement improves code portability to a wider range of compilers and operating systems. Portability begins with the base HPC language, generally C, C++, or Fortran. Each of these languages maintains standards for compiler implementations, and new standard releases occur periodically. But this does not mean that compilers implement these readily. Often the lag time from release to full implementation by compiler vendors can be long. For example, the Polyhedron Solutions website (http://mng.bz/yYne) reports that no Linux Fortran compiler fully implements the 2008 standard, and less than half fully implement the 2003 standard. Of course, what matters is if the compilers have implemented the features that you want. C and C++ compilers are usually more up-to-date in their implementations of new standards, but the lag time can still cause problems for aggressive development teams. Also, even if the features are implemented, it does not mean these work in a wide variety of settings.
使用各种编译器进行编译有助于检测编码错误或确定代码在语言解释方面处于“边缘”的位置。可移植性为使用在特定环境中效果最佳的工具提供了灵活性。例如,Valgrind 与 GCC 配合使用效果最佳,但 Intel® Inspector(一种线程正确性工具)在使用英特尔编译器编译应用时效果最佳。使用并行语言时,可移植性也会有所帮助。例如, CUDA Fortran 仅适用于 PGI 编译器。基于 GPU 指令的语言 OpenACC 和 OpenMP(具有 target 指令)的当前实现集仅在一小部分编译器上可用。幸运的是,MPI 和适用于 CPU 的 OpenMP 广泛可用于许多编译器和系统。此时,我们需要明确 OpenMP 有三种不同的功能:1) 通过 SIMD 指令进行向量化,2) 从原始 OpenMP 模型进行 CPU 线程化,以及 3) 通过新的目标指令卸载到加速器,通常是 GPU。
Compiling with a variety of compilers helps to detect coding errors or identify where code is pushing the “edge” on language interpretations. Portability provides flexibility when using tools that work best in a particular environment. For example, Valgrind works best with GCC, but Intel® Inspector, a thread correctness tool, works best when you compile the application with Intel compilers. Portability also helps when using parallel languages. For example, CUDA Fortran is only available with the PGI compiler. The current set of implementations of GPU directive-based languages OpenACC and OpenMP (with the target directive) are only available on a small set of compilers. Fortunately, MPI and the OpenMP for CPUs are widely available for many compilers and systems. At this point, we need to it make clear that there are three distinct OpenMP capabilities: 1) vectorization through SIMD directives, 2) CPU threading from the original OpenMP model, and 3) offloading to an accelerator, generally a GPU, through the new target directives.
性能分析(图 2.4)确定硬件性能,并将其与应用程序性能进行比较。功能与当前性能之间的差异产生了性能改进的潜力。
Profiling (figure 2.4) determines the hardware performance capabilities and compares that with your application performance. The difference between the capabilities and current performance yields the potential for performance improvement.
图 2.4 性能分析步骤的目的是确定应用程序代码中需要解决的最重要部分。
Figure 2.4 The purpose of the profiling step is to identify the most important parts of the application code that need to be addressed.
性能分析过程的第一部分是确定应用程序性能的限制因素。我们将在 Section 3.1 中详细介绍应用程序可能的性能限制。简而言之,当今的大多数应用程序都受到内存带宽或密切跟踪内存带宽的限制的限制。一些应用程序可能会受到可用浮点运算 (flops) 的限制。我们将在第 3.2 节中介绍计算理论性能极限的方法。我们还将介绍可以测量该硬件限制可实现的性能的基准测试程序。
The first part of the profiling process is to determine what is the limiting aspect of your application’s performance. We’ll detail possible performance limitations for applications in section 3.1. Briefly, most applications today are limited by memory bandwidth or a limitation that closely tracks memory bandwidth. A few applications might be limited by available floating-point operations (flops). We’ll present ways to calculate theoretical performance limits in section 3.2. We’ll also describe benchmark programs that can measure the achievable performance for that hardware limitation.
了解潜在性能后,您就可以分析您的应用程序。我们将在 Section 3.3 中介绍使用一些分析工具的过程。然后,应用程序的当前性能与应用程序限制方面的硬件功能之间的差距成为并行性后续步骤的改进目标。
Once the potential performance is understood, then you can profile your application. We’ll present the process of using some profiling tools in section 3.3. The gap between the current performance of your application and the hardware capabilities for the limiting aspect of your application then become the target for improvement for the next steps in parallelism.
有了在您的应用程序和目标平台上收集的信息,是时候将一些细节放入计划中了。图 2.5 显示了此步骤的部分内容。由于并行性需要付出努力,因此在开始实施步骤之前研究先前的工作是明智的。
Armed with the information gathered on your application and the targeted platforms, it is time to put some details into a plan. Figure 2.5 shows parts of this step. With the effort that’s required in parallelism, it is sensible to research prior work before starting the implementation step.
Figure 2.5 The planning steps lay the foundation for a successful project.
过去很可能遇到过类似的问题。您将找到近年来发表的许多关于并行项目和技术的研究文章。但最丰富的信息来源之一包括已经发布的基准测试和迷你应用程序。使用迷你应用程序,您不仅可以进行研究,还可以学习实际代码。
It is likely that similar problems were encountered in the past. You’ll find many research articles on parallelism projects and techniques published in recent years. But one of the richest sources of information includes the benchmarks and mini-apps that have been released. With mini-apps, you have not only the research but also the actual code to study.
高性能计算社区开发了许多基准测试、内核和示例应用程序,用于基准测试系统、性能实验和算法开发。我们将在 Section 17.4 中列出其中的一些。您可以使用基准测试来帮助为您的应用程序选择最合适的硬件,而迷你应用程序提供有关最佳算法和编码技术的帮助。
The high performance computing community has developed many benchmarks, kernels, and sample applications for use in benchmarking systems, performance experiments, and algorithm development. We’ll list some of these in section 17.4. You can use benchmarks to help select the most appropriate hardware for your application, and mini-apps provide help on the best algorithms and coding techniques.
基准测试旨在突出硬件性能的特定特征。现在,您已经了解了应用程序的性能限制,您应该查看最适合您的情况的基准。如果您在以线性方式访问的大型数组上进行计算,则流基准测试是合适的。如果您将迭代矩阵求解器作为内核,那么高性能共轭梯度 (HPCG) 基准测试可能会更好。迷你应用程序更侧重于一类科学应用程序中的典型操作或模式。
Benchmarks are intended to highlight a specific characteristic of hardware performance. Now that you have a sense of what is the performance limit of your application, you should look at the benchmarks most applicable to your situation. If you compute on large arrays that are accessed in a linear fashion, then the stream benchmark is appropriate. If you have an iterative matrix solver as your kernel, then the High Performance Conjugate Gradient (HPCG) benchmark might be better. Mini-apps are more focused on a typical operation or pattern found in a class of scientific applications.
值得看看这些基准测试或迷你应用程序是否与您正在开发的并行应用程序相似。如果是这样,研究它们如何执行类似的操作可以节省大量精力。通常,已经对代码进行了大量工作,以探索如何获得最佳性能、移植到其他并行语言和平台,或量化性能特征。
It is worthwhile to see if any of these benchmarks or mini-apps are similar to the parallel application you are developing. If so, studying how these do similar operations can save a lot of effort. Often, a lot of work has been done with the code to explore how to get the best performance, to port to other parallel languages and platforms, or to quantify performance characteristics.
目前,基准测试和迷你应用程序主要来自科学计算领域。我们将在示例中使用其中一些,我们鼓励您将这些用于实验和示例代码。这些示例演示了许多关键操作和并行实现。
Currently benchmarks and mini-apps are predominantly from the field of scientific computing. We’ll use some of these in our examples, and you are encouraged to use these for your experimentation and as example code. Many of the key operations and parallel implementations are demonstrated in these examples.
数据结构的设计对应用程序有长期影响。这是需要提前做出的决定之一,要意识到以后更改设计会变得很困难。在第 4 章中,我们将介绍一些重要的考虑因素,以及一个演示不同数据结构性能分析的案例研究。
The design of data structures has a long ranging impact on your application. This is one of the decisions that needs to be made up front, realizing that changing the design later becomes difficult. In chapter 4, we go through some of the considerations that are important, along with a case study that demonstrates the analysis of the performance of different data structures.
首先,请关注数据和数据移动。这是当今硬件平台的主要考虑因素。它还导致了有效的并行实现,其中数据的谨慎移动变得更加重要。如果我们也考虑文件系统和网络,数据移动将主导一切。
To begin, focus on the data and data movement. This is the dominant consideration with today’s hardware platforms. It also leads into an effective parallel implementation where the careful movement of data becomes even more important. If we consider the filesystem and network as well, data movement dominates everything.
此时,您应该评估应用程序中的算法。这些可以修改为并行编码吗?是否有具有更好可扩展性的算法?例如,您的应用程序可能有一段代码只占用 5% 的运行时间,但具有 N 2 的算法缩放,而其余代码则使用 N 进行缩放,其中 N 是单元格或其他一些数据组件的数量。随着问题规模的增长,5% 很快就会变成 20%,然后甚至更高。很快它就主导了运行时间。要识别这些类型的问题,您可能希望分析一个更大的问题,然后查看运行时的增长情况,而不是绝对百分比。
At this point, you should evaluate the algorithms in your application. Can these be modified for parallel coding? Are there algorithms that have better scalability? For example, your application may have a section of code that only takes 5% of the run time but has an N 2 algorithmic scaling, while the rest of the code scales with N, where N is the number of cells or some other data component. As the problem size grows, the 5% soon becomes 20% and then even higher. Soon it is dominating the run time. To identify these kinds of issues, you might want to profile a larger problem and then look at the growth in the run time rather than the absolute percentage.
这就是我认为的肉搏战步骤。在战壕中,逐行、逐循环和逐例循环,代码被转换为并行代码。这就是您在 CPU 和 GPU 上实现并行实现的所有知识发挥作用的地方。如图 2.6 所示,本书的其余部分将涵盖这些材料。关于并行编程语言的章节、第 6-8 章(针对 CPU)和第 9-13 章(针对 GPU)开始了您发展这些专业知识的旅程。
This is the step I think of as hand-to-hand combat. Down in the trenches, line-by-line, loop-by-loop, and routine-by-routine the code is transformed to parallel code. This is where all your knowledge of parallel implementations on CPUs and GPUs comes to effect. As figure 2.6 shows, this material will be covered in much of the rest of the book. The chapters on parallel programming languages, chapters 6-8 for CPUs and chapters 9-13 for GPUs, begin your journey to developing this expertise.
图 2.6 实现步骤利用了本书其余部分开发的并行语言和技能。
Figure 2.6 The implementation step utilizes parallel languages and skills developed in the rest of the book.
在实施步骤中,跟踪您的总体目标非常重要。此时,您可能已经决定了您的平行语言,也可能没有。即使您有,您也应该愿意在更深入地实施时重新评估您的选择。您选择项目方向的一些初步考虑因素包括
During the implementation step, it is important to keep track of your overall goals. You may or may not have decided on your parallel language at this point. Even if you have, you should be willing to reassess your choice as you get deeper into the implementation. Some of the initial considerations for your choice of direction for the project include
Are your speedup requirements fairly modest? You should explore vectorization and shared memory (OpenMP) parallelism in chapters 6 and 7.
Do you need more memory to scale up? If so, you will want to explore distributed memory parallelism in chapter 8.
Do you need large speedups? Then GPU programming is worth looking into in chapters 9-13.
此实施步骤的关键是将工作分解为可管理的块,并在团队成员之间分配工作。既有在例程中获得一个数量级加速的兴奋,也有意识到整体影响很小的兴奋,还有很多工作要做。毅力和团队合作对于实现目标很重要。
The key in this implementation step is to break down the work into manageable chunks and divide out the work among your team members. There is both the exhilaration of getting an order of magnitude speedup in a routine and realizing that the overall impact is small, and there is still a lot of work to do. Perseverance and teamwork are important in reaching the goal.
提交步骤通过仔细检查来完成这部分工作,以验证代码质量和可移植性是否得到维护。图 2.7 显示了此步骤的组成部分。这些检查的范围在很大程度上取决于应用程序的性质。对于具有许多用户的生产应用程序,测试需要更加彻底。
The commit step finalizes this part of the work with careful checks to verify that code quality and portability are maintained. Figure 2.7 shows the components of this step. How extensive these checks are is highly dependent on the nature of the application. For production applications with many users, the tests need to be far more thorough.
注意此时,捕获相对小规模的问题比在 1000 个处理器上运行 6 天调试复杂情况要容易得多。
Note At this point, it is easier to catch relatively small-scale problems than it is to debug complications six days into a run on a thousand processors.
图 2.7 提交步骤的目标是在阶梯上创建一个坚实的梯级,以达到你的最终目标。
Figure 2.7 The goal of the commit step is to create a solid rung on the ladder to reaching your end goal.
团队必须支持提交流程并共同努力遵循它。建议召开一次团队会议,制定所有人都要遵循的程序。在创建过程时,可以利用在最初努力提高代码质量和可移植性时使用的过程。最后,应定期重新评估提交过程,并根据当前项目需求进行调整。
The team must buy-in to the commit process and work together to follow it. It is suggested that there be a team meeting to develop the procedures for all to follow. The processes used during the initial efforts to improve code quality and portability can be exploited in creating your procedures. Lastly, the commit process should be re-evaluated periodically and adapted for the current project needs.
在本章中,我们只介绍了如何处理新项目以及可用工具可以做什么的表面。有关更多信息,请浏览资源并尝试以下部分中的一些练习。
In this chapter, we have only brushed the surface of how to approach a new project and what the available tools can do. For more information, explore the resources and try some of the exercises in the following sections.
当今分布式版本控制工具的额外专业知识使您的项目受益。您的团队中至少有一名成员应该研究 Web 上的许多资源,这些资源讨论了如何使用您选择的版本控制系统。如果您使用 Git,Manning 的以下书籍是很好的资源:
Additional expertise with today’s distributed version control tools benefits your project. At least one member of your team should research the many resources on the web that discuss how to use your chosen version control system. If you use Git, the following books from Manning are good resources:
测试在并行开发工作流程中至关重要。单元测试可能是最有价值的,但也是最难实现的。Manning 有一本书对单元测试进行了更深入的讨论:
Testing is vitally important in the parallel development workflow. Unit testing is perhaps the most valuable but also the most difficult to implement well. Manning has a book that gives a much more thorough discussion of unit testing:
Vladimir Khorikov,单元测试原则、实践和模式(Manning,2020 年)。
Vladimir Khorikov, Unit Testing Principles, Practices, and Patterns (Manning, 2020).
浮点运算和精度是一个被低估的话题,尽管它对每个计算科学家都很重要。以下是有关浮点运算的良好阅读和概述:
Floating-point arithmetic and precision is an underappreciated topic, despite its importance to every computational scientist. The following is a good read and overview on floating-point arithmetic:
David Goldberg,“每个计算机科学家都应该了解的浮点运算知识”,ACM 计算调查 (CSUR) 23,第 1 期(1991 年):5-48。
David Goldberg, “What every computer scientist should know about floating-point arithmetic,” ACM Computing Surveys (CSUR) 23, no. 1 (1991): 5-48.
您有一个在研究生期间开发的波高模拟应用程序。它是一个串行应用程序,因为它只是计划作为您论文的基础,所以您没有采用任何软件工程技术。现在,您计划将其用作许多研究人员可以使用的可用工具的起点。您的团队中还有其他三个开发人员。为此,您会在项目计划中包括哪些内容?
You have a wave height simulation application that you developed during graduate school. It is a serial application and because it was only planned to be the basis for your dissertation, you didn’t incorporate any software engineering techniques. Now you plan to use it as the starting point for an available tool that many researchers can use. You have three other developers on your team. What would you include in your project plan for this?
本章涵盖了许多内容,其中包含并行项目计划所需的许多细节。对性能能力的估计以及使用工具来提取有关硬件特性和应用程序性能的信息,为填充计划提供了可靠、具体的数据点。正确使用这些工具和技能有助于为成功的并行项目奠定坚实的基础。
This chapter has covered a lot of ground with many of the details necessary for a parallel project plan. The estimation of performance capabilities and uses of tools to extract information on hardware characteristics and application performance give solid, concrete data points to populate the plan. The proper use of these tools and skills can help build a solid foundation for a successful parallel project.
代码准备是并行工作的重要组成部分。每个开发人员都对为项目准备代码所花费的精力感到惊讶。但是这段时间是值得的,因为它是成功的并行项目的基础。
Code preparation is a significant part of parallelism work. Every developer is surprised at the amount of effort spent preparing the code for the project. But this time is well spent in that it is the foundation for a successful parallelism project.
您应该提高并行代码的代码质量。代码质量必须比典型的序列号高一个数量级。这种对质量的需求部分在于大规模调试的难度,部分是由于并行过程中暴露的缺陷,或者仅仅是由于每行代码执行的迭代次数过多。也许这是因为遇到缺陷的概率非常小,但是当一千个处理器正在运行代码时,发生的可能性就会增加一千倍。
You should improve your code quality for parallel code. Code quality must be an order of magnitude better than typical serial code. Part of this need for quality resides in the difficulty of debugging at scale and part is due to flaws that are exposed in the parallelism process or simply due to the sheer number of iterations that each line of code is executed. Perhaps this is because the probability of encountering a flaw is quite small, but when a thousand processors are running the code, it becomes a thousand times more likely to occur.
The profiling step is important to determine where to focus optimization and parallelism work. Chapter 3 provides more details on how to profile your application.
每个开发迭代都有一个整体项目计划和另一个单独的计划。这两个计划都应该包括一些研究,包括迷你应用程序、数据结构设计和新的并行算法,为下一步奠定基础。
There is an overall project plan and another separate plan for each iteration of development. Both of these plans should include some research to include mini-apps, data structure designs, and new parallel algorithms to lay the foundation for the next steps.
对于提交步骤,我们需要开发流程来保持良好的代码质量。这应该是一项持续的工作,而不是在代码投入生产或现有用户群开始遇到大型、长时间运行的模拟问题时推送。
With the commit step, we need to develop processes to maintain good code quality. This should be an ongoing effort and not pushed to later when the code is put into production or when the existing user base starts encountering problems with large, long-running simulations.
程序员资源稀缺。您需要定位这些资源,以便它们产生最大的影响。如果您不知道应用程序的性能特征和计划运行的硬件,该怎么办?这就是本章要解决的问题。通过测量硬件和应用程序的性能,您可以确定将开发时间花在哪些方面最有效。
Programmer resources are scarce. You need to target these resources so that they have the most impact. How do you do this if you don’t know the performance characteristics of your application and the hardware you plan to run on? That is what this chapter means to address. By measuring the performance of your hardware and your application, you can determine where it’s most effective to spend your development time.
注意我们鼓励您按照本章的练习进行操作。这些练习可以在 https://github.com/EssentialsofParallel Computing/Chapter3 中找到。
Note We encourage you to follow along with the exercises for this chapter. The exercises can be found at https://github.com/EssentialsofParallel Computing/Chapter3.
计算科学家仍然将浮点运算 (flops) 视为主要性能限制。虽然这在几年前可能是正确的,但现实是 flops 很少限制现代架构中的性能。但限制可以是带宽或延迟。带宽是数据在系统中通过给定路径移动的最佳速率。要使带宽成为限制,代码应使用流式处理方法,其中内存通常需要连续并使用所有值。当无法使用流式处理方法时,延迟是更合适的限制。延迟是传输第一个字节或第一个字数据所需的时间。下面显示了一些可能的硬件性能限制:
Computational scientists still consider floating-point operations (flops) as the primary performance limit. While this might have been true years ago, the reality is that flops seldom limit performance in modern architectures. But limits can be for bandwidth or for latency. Bandwidth is the best rate at which data can be moved through a given path in the system. For bandwidth to be the limit, the code should use a streaming approach, where the memory usually needs to be contiguous and all the values used. When a streaming approach is not possible, latency is the more appropriate limit. Latency is the time required for the first byte or word of data to be transferred. The following shows some of the possible hardware performance limits:
我们可以将所有这些限制分为两大类:速度和馈送。速度是完成操作的速度。它包括所有类型的计算机操作。但为了能够执行这些操作,您必须在那里获取数据。这就是 Feed 的用武之地。源包括通过缓存层次结构的内存带宽,以及网络和磁盘带宽。对于无法获得流式处理行为的应用程序,内存、网络和磁盘馈送的延迟更为重要。延迟时间可能比带宽时间慢几个数量级。应用程序是否受延迟限制或流带宽控制的最大因素之一是编程质量。组织数据以便可以以流式处理模式使用数据,可以显著提高速度。
We can break down all of these limitations into two major categories: speeds and feeds. Speeds are how fast operations can be done. It includes all types of computer operations. But to be able to do the operations, you must get the data there. This is where feeds come in. Feeds include the memory bandwidth through the cache hierarchy, as well as network and disk bandwidth. For applications that cannot get streaming behavior, the latency of memory, network and disk feed are more important. Latency times can be orders of magnitude slower than those for bandwidth. One of the biggest factors in whether applications are controlled by latency limits or streaming bandwidths is the quality of the programming. Organizing your data so that it can be consumed in a streaming pattern can yield dramatic speedups.
不同硬件组件的相对性能如图 3.1 所示。让我们使用每个周期加载 1 个字和每个周期 1 次浮点运算(由大点标记)作为起点。大多数标量算术运算,如加法、减法和乘法,都可以在 1 个周期内完成。除法操作可能需要更长的时间,需要 3-5 个周期。在某些算术混合中,使用融合的乘加指令可以 2 次浮点运算/周期。使用向量单元和多核处理器,可以完成的算术运算数量进一步增加。硬件进步(主要通过并行性)大大提高了 flops/cycle。
The relative performance of different hardware components is shown in figure 3.1. Let’s use the 1 word loaded per cycle and 1 flop per cycle marked by the large dot as our starting point. Most scalar arithmetic operations like addition, subtraction, and multiplication, can be done in 1 cycle. The division operation can take longer at 3-5 cycles. In some arithmetic mixes, 2 flops/cycle are possible with the fused multiply-add instruction. The number of arithmetic operations that can be done increases further with vector units and multi-core processors. Hardware advances, mostly through parallelism, greatly increase the flops/cycle.
图 3.1 车顶线图上显示的进给和速度。传统的标量 CPU 接近于阴影圆圈表示的每个周期加载 1 个字和每个周期 1 flop。flops 增加的乘数是由于融合的乘法加法指令、向量化、多核和超线程。还显示了内存移动的相对速度。我们将在 3.2.4 节中更多地讨论屋顶线图。
Figure 3.1 Feeds and speeds shown on a roofline plot. The conventional scalar CPU is close to the 1 word loaded per cycle and 1 flop per cycle indicated by the shaded circle. The multipliers for the increase in flops are due to the fused multiply-add instruction, vectorization, multiple cores, and hyperthreads. The relative speeds of memory movement are also shown. We’ll discuss the roofline plot more in section 3.2.4.
查看倾斜的内存限制,我们看到,通过更深的缓存层次结构提高性能意味着,只有当数据包含在 L1 缓存中时,内存访问才能与操作加速相匹配,通常约为 32 KiB。但是,如果我们只有那么多数据,我们就不会那么担心它所花费的时间。我们真的希望对只能包含在主内存 (DRAM) 甚至磁盘或网络上的大量数据进行操作。最终结果是处理器的浮点功能的增长速度远远超过内存带宽。这导致许多机器余额每加载 8 字节字的能力约为 50 flops。为了了解这种对应用程序的影响,我们测量了它的算术强度。
Looking at the sloped memory limits, we see that the performance increase through a deeper hierarchy of caches means that memory accesses can only match the speedup of operations if the data is contained in the L1 cache, typically about 32 KiB. But if we only have that much data, we wouldn’t be so worried about the time it takes. We really want to operate on large amounts of data that can only be contained in main memory (DRAM) or even on the disk or network. The net result is that the floating-point capabilities of processors have increased far faster than memory bandwidth. This has led to many machine balances of the order of 50 flops capability for every 8-byte word loaded. To understand this impact on applications, we measure its arithmetic intensity.
算术强度 — 在应用程序中,测量每个内存操作执行的 flops 数,其中内存操作可以以字节或字为单位(双精度值为字为 8 个字节,单精度值为 4 个字节)。
Arithmetic intensity—In an application, measures the number of flops executed per memory operations, where memory operations can be either in bytes or words (a word is 8 bytes for a double and 4 bytes for a single-precision value).
Machine balance—Indicates for computing hardware the total number of flops that can be executed divided by the memory bandwidth.
大多数应用程序的算术强度接近每加载一个字 1 flop,但也有一些应用程序具有更高的算术强度。高算术强度应用的经典示例使用密集矩阵求解器来求解方程组。这些求解器的使用过去在应用程序中比现在要普遍得多。Linpack 基准测试使用此操作中的内核来表示此类应用程序。Peise 报告称,该基准测试的算术强度为 62.5 flops/字(参见附录 A 中的参考资料,Peise,2017 年,第 201 页)。这对于大多数系统来说已经足够了,可以最大限度地发挥浮点功能。在最大计算系统的前 500 名排名中大量使用 Linpack 基准测试已成为当前机器设计以高 flop-memory 负载比为目标的主要原因。
Most applications have an arithmetic intensity close to 1 flop per word loaded, but there are also applications that have a higher arithmetic intensity. The classic example of a high arithmetic intensity application uses a dense matrix solver to solve a system of equations. The use of these solvers used to be much more common in applications than is true today. The Linpack benchmark uses the kernel from this operation to represent this class of applications. The arithmetic intensity for this benchmark is reported by Peise to be 62.5 flops/word (see reference in appendix A, Peise, 2017, pg. 201). This is sufficient for most systems to max out the floating-point capability. The heavy use of the Linpack benchmark for the top 500 ranking of the largest computing systems has become a leading reason for current machine designs that target a high flop-to-memory load ratio.
对于许多应用程序,即使达到内存带宽限制也可能很困难。要了解内存带宽,必须对内存层次结构和架构有一定的了解。内存和 CPU 之间的多个缓存有助于在内存层次结构中隐藏较慢的主内存(第 3.2.3 节中的图 3.5)。数据以称为缓存行的块的形式在内存层次结构中向上传输。如果未以连续、可预测的方式访问内存,则不会实现完整的内存带宽。仅访问按行顺序存储的 2D 数据结构的列中的数据,将在内存中跨行长度。这可能导致每个缓存行只使用一个值。此数据访问模式的内存带宽粗略估计为流带宽的 1/8(每使用8个缓存值中就有1个)。这可以通过根据使用的缓存百分比 (U 缓存) 和经验带宽 (B E) 定义非连续带宽 (B nc) 来推广到发生更多缓存使用的其他情况:
For many applications, even achieving the memory bandwidth limit can be difficult. Some understanding of the memory hierarchy and architecture is necessary to understand memory bandwidth. Multiple caches between memory and the CPU help hide the slower main memory (figure 3.5 in section 3.2.3) in the memory hierarchy. Data is transported up the memory hierarchy in chunks called cache lines. If memory is not accessed in a contiguous, predictable fashion, the full memory bandwidth is not achieved. Merely accessing data in columns for a 2D data structure that is stored in row order will stride across memory by the row length. This can result in as little as one value being used out of each cache line. A rough estimate of the memory bandwidth from this data access pattern is 1/8th of the stream bandwidth (1 out of every 8 cache values used). This can be generalized for other cases where more cache usage occurs by defining a non-contiguous bandwidth (B nc) in terms of the percentage of cache used (U cache) and the empirical bandwidth (B E):
Bnc = U缓存 × B E = 经验带宽×使用的缓存的平均百分比
Bnc = Ucache × B E = Average Percentage of Cache Used × Empirical Bandwidth
还有其他可能的性能限制。指令高速缓存可能无法以足够快的速度加载指令,从而使处理器内核保持忙碌。整数运算也是一个比通常假设的更频繁的限制器,尤其是对于索引计算变得更加复杂的高维数组。
There are other possible performance limits. The instruction cache may not be able to load instructions fast enough to keep a processor core busy. Integer operations are also a more frequent limiter than commonly assumed, especially with higher dimensional arrays where the index calculations become more complex.
对于需要大量网络或磁盘操作的应用程序(例如大数据、分布式计算或消息传递),网络和磁盘硬件限制可能是最严重的问题。要了解这些设备性能限制的程度,请考虑以下经验法则:对于通过高性能计算机网络传输的第一个字节所花费的时间,您可以在单个处理器内核上执行超过 1,000 次浮点运算。标准机械磁盘系统的第一个字节要慢几个数量级,这导致了当今文件系统的高度异步、缓冲操作,并引入了固态存储设备。
For applications that require significant network or disk operations (such as big data, distributed computing, or message passing), network and disk hardware limits can be the most serious concern. To get an idea of the magnitude of these device performance limitations, consider the rule of thumb that for the time taken for the first byte transferred over a high performance computer network, you can do over 1,000 flops on a single processor core. Standard mechanical disk systems are orders of magnitude slower for the first byte, which has led to the highly asynchronous, buffered operation of today’s filesystems and to the introduction of solid-state storage devices.
准备好应用程序和测试套件后,您可以开始描述针对生产运行的硬件的特征。为此,您需要为硬件开发一个概念模型,以便您了解其性能。性能可以用许多指标来表征:
Once you have prepared your application and your test suites, you can begin characterizing the hardware that you are targeting for production runs. To do this, you need to develop a conceptual model for the hardware that allows you to understand its performance. Performance can be characterized by a number of metrics:
The rate at which floating-point operations can be executed (FLOPs/s)
The rate at which data can be moved between various levels of memory (GB/s)
The rate at which energy is used by your application (Watts)
概念模型允许您估计计算硬件的各个组件的理论峰值性能。您在这些模型中使用的指标以及您旨在优化的指标取决于您和您的团队在应用程序中的重视程度。为了补充此概念模型,您还可以在目标硬件上进行经验测量。实证测量是使用微基准测试应用程序进行的。微基准测试的一个示例是用于带宽受限情况的 STREAM 基准测试。
The conceptual models allow you to estimate the theoretical peak performance of various components of the compute hardware. The metrics you work with in these models, and those you aim to optimize, depend on what you and your team value in your application. To complement this conceptual model, you can also make empirical measurements on your target hardware. The empirical measurements are made with micro-benchmark applications. One example of a micro-benchmark is the STREAM Benchmark that is used for bandwidth-limited cases.
在确定硬件性能时,我们混合使用理论和实证测量。虽然是互补的,但理论值提供了性能的上限,实证测量证实了在接近实际操作条件下简化内核可以实现的目标。
In determining hardware performance, we use a mixture of theoretical and empirical measurements. Although complementary, the theoretical value provides an upper bound to performance, and the empirical measurement confirms what can be achieved in a simplified kernel in close to actual operating conditions.
获得硬件性能规格非常困难。处理器模型的爆炸式增长以及面向更广泛公众的营销和媒体评论的关注往往掩盖了技术细节。此类的良好资源包括
It is surprisingly difficult to get hardware performance specifications. The explosion of processor models and the focus of marketing and media reviews for the broader public often obscure the technical details. Good resources for such include
对于 Intel 处理器,https://ark.intel.com
For Intel processors, https://ark.intel.com
对于 AMD 处理器,https://www.amd.com/en/products/specifications/processors
For AMD processors, https://www.amd.com/en/products/specifications/processors
了解您运行的硬件的最佳工具之一是 lstopo 程序。它与几乎所有 MPI 发行版附带的 hwloc 软件包捆绑在一起。此命令输出系统上硬件的图形视图。图 3.2 显示了 Mac 笔记本电脑的输出。输出可以是图形的,也可以是基于文本的。要获得图 3.2 中的图片,目前需要自定义安装 hwloc 和 cairo 软件包以启用 X11 接口。文本版本适用于标准包管理器安装。只要您可以显示 X11 窗口,Linux 和 Unix 版本的 hwloc 通常就可以工作。新命令 netloc 正在添加到 hwloc 软件包中,用于显示网络连接。
One of the best tools for understanding the hardware you run is the lstopo program. It is bundled with the hwloc package that comes with nearly every MPI distribution. This command outputs a graphical view of the hardware on your system. Figure 3.2 shows the output for a Mac laptop. The output can be graphical or text-based. To get the picture in figure 3.2 currently requires a custom installation of hwloc and the cairo packages to enable the X11 interface. The text version works with the standard package manager installs. Linux and Unix versions of hwloc usually work as long as you can display an X11 window. A new command, netloc, is being added to the hwloc package to display the network connections.
图 3.2 使用 lstopo 命令的 Mac 笔记本电脑的硬件拓扑
Figure 3.2 Hardware topology for a Mac laptop using the lstopo command
从 https://www.cairographics.org/releases/ 下载 cairo
Download cairo from https://www.cairographics.org/releases/
./configure --with-x --prefix=/usr/local make make install
./configure --with-x --prefix=/usr/local make make install
从 Git 克隆 hwloc 包: https://github.com/open-mpi/hwloc.git
Clone the hwloc package from Git: https://github.com/open-mpi/hwloc.git
./configure --prefix=/usr/local make make install
./configure --prefix=/usr/local make make install
用于探测硬件详细信息的其他一些命令包括 Linux 系统上的 lscpu、Windows 上的 wmic 和 Mac 上的 sysctl 或 system_profiler。Linux 中的 lscpu 命令会输出 /proc/cpuinfo 文件中信息的合并报告。您可以通过直接查看 /proc/cpuinfo 来查看每个逻辑核心的完整信息。来自 lscpu 命令和 /proc/cpuinfo 文件的信息有助于确定系统的处理器数量、处理器型号、高速缓存大小和时钟频率。这些标志包含有关芯片的 vector 指令集的重要信息。在图 3.3 中,我们看到 AVX2 和各种形式的 SSE 向量指令集是可用的。我们将在第 6 章中更多地讨论向量指令集。
Some other commands for probing hardware details are lscpu on Linux systems, wmic on Windows, and sysctl or system_profiler on Mac. The Linux lscpu command outputs a consolidated report of the information from the /proc/cpuinfo file. You can see the full information for every logical core by viewing /proc/cpuinfo directly. The information from the lscpu command and the /proc/cpuinfo file helps to determine the number of processors, the processor model, the cache sizes, and the clock frequency for the system. The flags contain important information on the vector instruction set for the chip. In figure 3.3, we see that the AVX2 and various forms of the SSE vector instruction set are available. We’ll discuss vector instruction sets more in chapter 6.
图 3.3 Linux 桌面的 lscpu 输出,显示 4 核 i5-6500 CPU @ 3.2 GHz,带有 AVX2 指令
Figure 3.3 Output from lscpu for a Linux desktop that shows a 4-core i5-6500 CPU @ 3.2 GHz with AVX2 instructions
获取有关 PCI 总线上设备的信息可能会有所帮助,特别是对于识别图形处理器的编号和类型。lspci 命令报告所有设备(图 3.4)。从图中的输出中,我们可以看到有一个 GPU,它是 NVIDIA GeForce GTX 960。
Obtaining information on the devices on the PCI bus can be helpful, particularly for identifying the number and type of the graphics processor. The lspci command reports all the devices (figure 3.4). From the output in the figure, we can see that there is one GPU and that it is an NVIDIA GeForce GTX 960.
图 3.4 Linux 桌面的 lspci 命令输出,其中显示了 NVIDIA GeForce GTX 960 GPU。
Figure 3.4 Output from the lspci command from a Linux desktop that shows an NVIDIA GeForce GTX 960 GPU.
让我们来看看配备 Intel Core i7-7920HQ 处理器的 2017 年中期 MacBook Pro 笔记本电脑的数字。这是一个运行频率为 3.1 GHz 的 4 核处理器,具有超线程。凭借其 turbo boost 功能,它在使用四个处理器时可以以 3.7 GHz 运行,在使用单个处理器时可以高达 4.1 GHz。理论最大 flops (F T) 可以通过以下方式计算
Let’s run through the numbers for a mid-2017 MacBook Pro laptop with an Intel Core i7-7920HQ processor. This is a 4-core processor running at a nominal frequency of 3.1 GHz with hyperthreading. With its turbo boost feature, it can run at 3.7 GHz when using four processors and up to 4.1 GHz when using a single processor. The theoretical maximum flops (F T) can be calculated with
f t = c v × f c × ic = 虚拟内核 × 时钟速率 × 触发器/周期
F T = C v × f c × Ic = Virtual Cores × Clock Rate × Flops/Cycle
内核数量包括超线程的影响,这些超线程使物理内核 (C h) 看起来是更多数量的虚拟或逻辑内核 (C v)。这里我们有两个超线程,它们使处理器的虚拟数量显示为 8。时钟速率是所有处理器都启用时的睿频加速速率。对于处理器,它是 3.7 GHz。最后,每个周期的 flops,或者更一般地说每个周期的指令数 (Ic),包括向量单元可以执行的同时操作的数量。
The number of cores includes the effects of hyperthreads that make the physical cores (C h) appear to be a greater number of virtual or logical cores (C v). Here we have two hyperthreads that make the virtual number of processors appear to be eight. The clock rate is the turbo boost rate when all the processors are engaged. For the processor, it is 3.7 GHz. Finally, the flops per cycle, or more generally instructions per cycle (Ic), includes the number of simultaneous operations that can be executed by the vector unit.
为了确定可以执行的操作数量,我们取向量宽度 (VW) 并除以字大小(以位为单位)(W位)。我们还将融合乘加 (FMA) 指令作为每个周期两个运算的另一个因子。我们在方程中将其称为融合运算 (Fops)。对于这个特定的处理器,我们得到
To determine the number of operations that can be performed, we take the vector width (VW) and divide by the word size in bits (Wbits). We also include the fused multiply-add (FMA) instruction as another factor of two operations per cycle. We refer to this as fused operations (Fops) in the equation. For this specific processor, we get
Ic = VW/W位 × F操作数 = (256 位向量单元/64 位) × (2 FMA) = 8 Flops/周期
Ic = VW/Wbits × Fops = (256-bit Vector Unit/64 bits) × (2 FMA) = 8 Flops/Cycle
C v = C h × HT =(4 个硬件内核× 2 个超线程)
C v = C h × HT = (4 Hardware Cores × 2 Hyperthreads)
F T =(8 个虚拟内核)× (3.7 GHz) ×(8 Flops/周期)= 236.8 GFlops/s
F T = (8 Virtual Cores) × (3.7 GHz) × (8 Flops/Cycle) = 236.8 GFlops/s
对于大多数大型计算问题,我们可以假设需要通过缓存层次结构从主内存加载大型数组(图 3.5)。多年来,内存层次结构越来越深,增加了更多级别的缓存,以补偿相对于主内存访问时间的处理速度的增加。
For most large computational problems, we can assume that there are large arrays that need to be loaded from main memory through the cache hierarchy (figure 3.5). The memory hierarchy has grown deeper over the years with the addition of more levels of cache to compensate for the increase in processing speed relative to the main memory access times.
图 3.5 内存层次结构和访问时间。内存被加载到高速缓存行中,并存储在高速缓存系统的每个级别以供重用。
Figure 3.5 Memory hierarchy and access times. Memory is loaded into cache lines and stored at each level of the cache system for reuse.
我们可以使用存储芯片规格来计算主存储器的理论内存带宽。一般公式为
We can calculate the theoretical memory bandwidth of the main memory using the memory chips specifications. The general formula is
BT = MTR × Mc × Tw × NS = 数据存储通道×数据传输速率 × 字节/访问×套接字
BT = MTR × Mc × Tw × Ns = Data Transfer Rate × Memory Channels × Bytes Per Access × Sockets
处理器安装在主板上的插槽中。主板是计算机的主系统板,插槽是插入处理器的位置。大多数主板是单插槽的,只能安装一个处理器。双插槽主板在高性能计算系统中更为常见。两个处理器可以安装在一个双插槽主板中,为我们提供更多的处理内核和更多的内存带宽。
Processors are installed in a socket on the motherboard. The motherboard is the main system board of the computer, and the socket is the location where the processor is inserted. Most motherboards are single-socket, where only one processor can be installed. Dual-socket motherboards are more common in high-performance computing systems. Two processors can be installed in a dual-socket motherboard, giving us more processing cores and more memory bandwidth.
数据或内存传输速率 (MTR) 通常以每秒数百万次传输 (MT/s) 为单位。双倍数据速率 (DDR) 内存在周期的顶部和底部执行每个周期两个事务的传输。这意味着内存总线时钟速率是以 MHz 为单位的传输速率的一半。内存传输宽度 (Tw) 为 64 位,因为有 8 位/字节,所以传输 8 个字节。在大多数台式机和笔记本电脑架构上有两个内存通道 (Mc)。如果您在两个内存通道中都安装内存,您将获得更好的带宽,但这意味着您不能简单地购买另一个 DRAM 模块并插入它。您将不得不用更大的模块替换所有模块。
The data or memory transfer rate (MTR) is usually given in millions of transfers per sec (MT/s). The double data rate (DDR) memory performs transfers at the top and bottom of the cycle for two transactions per cycle. This means that the memory bus clock rate is half of the transfer rate in MHz. The memory transfer width (Tw) is 64 bits and because there are 8 bits/byte, 8 bytes are transferred. There are two memory channels (Mc) on most desktop and laptop architectures. If you install memory in both memory channels, you will get better bandwidth, but this means you cannot simply buy another DRAM module and insert it. You will have to replace all the modules with larger modules.
对于配备 LPDDR3-2133 内存的 2017 MacBook Pro 和两个通道,理论内存带宽 (BT) 可以根据 2133 MT/s 的内存传输速率 (MTR)、通道数 (MC) 和主板上的插槽数计算得出:
For the 2017 MacBook Pro with LPDDR3-2133 memory and for two channels, the theoretical memory bandwidth (BT) can be calculated from the memory transfer rate (MTR) of 2133 MT/s, the number of channels (Mc), and the number of sockets on the motherboard:
BT = 2133 MT/s × 2 个通道 × 8 字节 × 1 个插槽 = 34,128 MiB/s 或 34.1 GiB/s
BT = 2133 MT/s × 2 channels × 8 bytes × 1 socket = 34,128 MiB/s or 34.1 GiB/s
由于内存层次结构其余部分的影响,可实现的内存带宽低于理论带宽。您将找到用于估计内存层次结构影响的复杂理论模型,但这超出了我们在简化的处理器模型中要考虑的范围。为此,我们将转向 CPU 带宽的经验测量。
The achievable memory bandwidth is lower than the theoretical bandwidth due to the effects of the rest of the memory hierarchy. You’ll find complex theoretical models for estimating the effects of the memory hierarchy, but that is beyond what we want to consider in our simplified processor model. For this, we will turn to empirical measurements of bandwidth at the CPU.
经验带宽是衡量内存从主内存加载到处理器的最快速率的指标。如果请求单个字节的内存,则需要 1 个周期才能从 CPU 寄存器中检索它。如果它不在 CPU 寄存器中,则它来自 L1 缓存。如果它不在 L1 高速缓存中,则 L1 高速缓存会将其从 L2 加载到主内存,依此类推。如果它一直进入 main memory,对于单个字节的 memory,可能需要大约 400 个 clock cycles。每个内存级别的第一个字节数据所需的这个时间称为内存延迟。一旦该值位于更高的缓存级别中,就可以更快地检索它,直到它从该缓存级别中被逐出。如果必须一次加载所有内存一个字节,这将非常慢。因此,当加载一个字节的内存时,会同时加载一整块数据(称为缓存行)。如果随后访问了 nearby 值,则这些值已经位于更高的缓存级别中。
The empirical bandwidth is the measurement of the fastest rate that memory can be loaded from main memory into the processor. If a single byte of memory is requested, it takes 1 cycle to retrieve it from a CPU register. If it is not in the CPU register, it comes from the L1 cache. If it is not in the L1 cache, the L1 cache loads it from L2 and so on to main memory. If it goes all the way to main memory, for a single byte of memory, it can take around 400 clock cycles. This time required for the first byte of data from each level of memory is called the memory latency. Once the value is in a higher cache level, it can be retrieved faster until it gets evicted from that level of the cache. If all memory has to be loaded a byte at a time, this would be painfully slow. So when a byte of memory is loaded, a whole chunk of data (called a cache line) is loaded at the same time. If nearby values are subsequently accessed, these are then already in the higher cache levels.
高速缓存行、高速缓存大小和高速缓存级别的数量调整大小,以尝试提供尽可能多的主内存理论带宽。如果我们尽快加载连续数据以充分利用缓存,我们将获得 CPU 的最大可能数据传输速率。此最大数据传输速率称为内存带宽。为了确定内存带宽,我们可以测量读取和写入大型数组的时间。根据以下经验测量,测得的带宽约为 22 GiB/s。这个测得的带宽就是我们将在下一章的简单性能模型中使用的带宽。
The cache lines, cache sizes, and number of cache levels are sized to try to provide as much of the theoretical bandwidth of the main memory as possible. If we load contiguous data as fast as possible to make the best use of the caches, we get the CPU’s maximum possible data transfer rate. This maximum data transfer rate is called the memory bandwidth. To determine the memory bandwidth, we can measure the time for reading and writing a large array. From the following empirical measurements, the measured bandwidth is about 22 GiB/s. This measured bandwidth is what we’ll use in the simple performance models in the next chapter.
使用两种不同的方法来测量带宽:STREAM Benchmark 和由 Empirical Roofline Toolkit 测量的屋顶模型。STREAM 基准测试由 John McCalpin 在 1995 年左右创建,以支持他的论点,即内存带宽远比峰值浮点能力重要。相比之下,roofline 模型(参见侧边栏标题为“Measuring bandwidth using the empirical Roofline Toolkit”的图以及本节后面的讨论)将内存带宽限制和峰值翻牌率集成到一个图中,其中的区域显示每个性能限制。Empirical Roofline Toolkit 由 Lawrence Berkeley National Laboratory 创建,用于测量和绘制屋顶线模型。
Two different methods are used for measuring the bandwidth: the STREAM Benchmark and the roofline model measured by the Empirical Roofline Toolkit. The STREAM Benchmark was created by John McCalpin around 1995 to support his argument that memory bandwidth is far more important than the peak floating-point capability. In comparison, the roofline model (see the figure in the sidebar entitled “Measuring bandwidth using the empirical Roofline Toolkit” and the discussion later in this section) integrates both the memory bandwidth limit and the peak flop rate into a single plot with regions that show each performance limit. The Empirical Roofline Toolkit was created by Lawrence Berkeley National Laboratory to measure and plot the roofline model.
STREAM Benchmark 测量读取和写入大型数组的时间。为此,根据 CPU 在读取数据时对数据执行的操作,有四种变体:copy、scale、add 和 triad 测量。copy 不执行浮点运算,scale 和 add 执行一次算术运算,三元组执行两次运算。当每个数据值仅使用一次时,它们都给出了从主内存预期加载数据的最大速率的度量,这些度量略有不同。在这种机制中,flop rate 受内存加载速度的限制。
The STREAM Benchmark measures the time to read and write a large array. For this, there are four variants, depending on the operations performed on the data by the CPU as it is being read: the copy, scale, add, and triad measurements. The copy does no floating-point work, the scale and add do one arithmetic operation, and the triad does two. These each give a slightly different measure of the maximum rate that data can be expected to be loaded from main memory when each data value is only used once. In this regime, the flop rate is limited by how fast memory can be loaded.
以下练习说明如何使用 STREAM Benchmark 测量给定 CPU 上的带宽。
The following exercise shows how to use the STREAM Benchmark to measure bandwidth on a given CPU.
如果计算可以重用 cache 中的数据,则可能会有更高的 flop rate。如果我们假设所有正在操作的数据都在 CPU 寄存器或 L1 缓存中,那么最大 flop 速率由 CPU 的 clock frequency 和每个周期可以执行的 flops 数量决定。这是前面例子中计算的理论最大翻牌率。
If a calculation can reuse the data in cache, much higher flop rates are possible. If we assume that all data being operated on is in a CPU register or maybe the L1 cache, then the maximum flop rate is determined by the CPU’s clock frequency and how many flops it can do per cycle. This is the theoretical maximum flop rate calculated in the preceding example.
现在,我们可以将这两者放在一起以创建屋顶线模型的绘图。车顶模型具有每秒 flops 的垂直轴和算术强度的水平轴。对于高算术强度,与加载的数据相比,有很多 flops,理论上的最大 flop 率是极限。这会在图上以最大翻牌率生成一条水平线。随着算术强度的降低,内存负载的时间开始占主导地位,我们无法再达到最大的理论浮点数。然后,这会在 roofline 模型中创建倾斜屋顶,其中可实现的 flop rate 会随着算术强度的下降而下降。地块右侧的水平线和左侧的斜线产生了让人联想到屋顶线的特征形状,以及后来被称为屋顶线模型或地块的形状。您可以确定 CPU 甚至 GPU 的屋顶线图,如以下练习所示。
Now we can put these two together to create a plot of the roofline model. The roofline model has a vertical axis of flops per second and a horizontal axis of arithmetic intensity. For high arithmetic intensity, where there are a lot of flops compared to the data loaded, the theoretical maximum flop rate is the limit. This produces a horizontal line on the plot at the maximum flop rate. As the arithmetic intensity decreases, the time for the memory loads starts to dominate, and we no longer can reach the maximum theoretical flops. This then creates the sloped roof in the roofline model, where the achievable flop rate slopes down as the arithmetic intensity drops. The horizontal line on the right of the plot and the sloped line on the left produce the characteristic shape reminiscent of a roofline and what has become known as the roofline model or plot. You can determine the roofline plot for a CPU or even a GPU as shown in the following exercise.
现在我们可以确定机器的平衡。机器余额是 flops 除以内存带宽。我们可以计算理论机器余额 (MBT) 和经验机器余额 (MBE),如下所示:
Now we can determine the machine balance. The machine balance is the flops divided by the memory bandwidth. We can calculate both a theoretical machine balance (MBT) and an empirical machine balance (MBE) like so:
MBT = FT / BT = 236.8 GFlops/s / 34.1 GiB/s ×(8 字节/字)= 56 Flops/字
MBT = FT / BT = 236.8 GFlops/s / 34.1 GiB/s × (8 bytes/word) = 56 Flops/word
MBE = FE / BE = 264.4 GFlops/s / 22 GiB/s ×(8 字节/字)= 96 Flops/字
MBE = FE / BE = 264.4 GFlops/s / 22 GiB/s × (8 bytes/word) = 96 Flops/word
在上一节的 roofline 图中,机器余额是 DRAM 带宽线与水平 flop limit 线的交点。我们看到交集略高于 10 Flops/Byte。乘以 8 将得到超过 80 Flops/字的机器余额。我们从这些不同的方法中得到了一些不同的机器平衡估计值,但大多数应用的结论是,我们处于带宽受限的状态。
In the roofline figure in the previous section, the machine balance is the intersection of the DRAM bandwidth line with the horizontal flop limit line. We see that intersection is just above 10 Flops/Byte. Multiplying by 8 would give a machine balance above 80 Flops/word. We get a few different estimates of the machine balance from these different methods, but the conclusion for most applications is that we are in the bandwidth-bound regime.
现在,您已经对使用硬件可以获得的性能有了一定的了解,您需要确定应用程序的性能特征是什么。此外,您应该了解不同的子例程和函数如何相互依赖。
Now that you have some sense of what performance you can get with the hardware, you need to determine what are the performance characteristics of your application. Additionally, you should develop an understanding of how different subroutines and functions depend on each other.
我们将重点介绍生成高级视图并提供其他信息或上下文的分析工具。分析工具有很多,但许多工具产生的信息比可以吸收的信息多。如果时间允许,您可能希望探索 Section 17.3 中列出的其他分析工具。我们还将展示免费提供的工具和商业工具的组合,以便您根据可用资源进行选择。
We’ll focus on profiling tools that produce a high-level view and that also provide additional information or context. There are a lot of profiling tools, but many produce more information than can be absorbed. As time permits, you may want to explore the other profiling tools listed in section 17.3. We’ll also present a mix of freely available tools and commercial tools so that you have options depending on your available resources.
请务必记住,此处的目标是隔离最适合将时间并行化应用程序的位置。目标不是了解您当前表现的每一个细节。很容易犯这样的错误,要么根本不使用这些工具,要么迷失在这些工具及其产生的数据中。
It is important to remember your goal here is to isolate where it is best to spend your time parallelizing your application. The goal is not to understand every last detail of your current performance. It is easy to make the mistake of either not using these tools at all or getting lost in the tools and the data those produce.
Using call graphs for hot-spot and dependency analysis
我们将从突出显示热点的工具开始,并以图形方式显示每个子例程与代码中其他子例程的关系。热点是执行过程中占用时间最长的内核。此外,调用图是一个图表,显示哪些例程调用其他例程。我们可以将这两组信息合并,以获得更强大的组合,我们将在下一个练习中看到。
We’ll start with tools that highlight hot spots and graphically display how each subroutine relates to others within code. Hot spots are kernels that occupy the largest amount of time during execution. Additionally, a call graph is a diagram that shows which routines call other routines. We can merge these two sets of information for an even more powerful combination as we will see in the next exercise.
许多工具可以生成调用图,包括 valgrind 的 cachegrind 工具。Cachegrind 的调用图突出显示了热点并显示子例程依赖关系。这种类型的图表对于规划开发活动以避免合并冲突非常有用。一种常见的策略是在团队中分离任务,以便每个团队成员完成的工作都在单个调用堆栈中进行。以下练习说明如何使用 Valgrind 工具套件和 Callgrind 生成调用图。Valgrind 套件中的另一个工具,KCacheGrind 或 QCacheGrind,然后显示结果。唯一的区别是,一个使用 X11 图形,另一个使用 Qt 图形。
A number of tools can generate call graphs, including valgrind’s cachegrind tool. Cachegrind’s call graphs highlight both hot spots and display subroutine dependencies. This type of graph is useful for planning development activities to avoid merge conflicts. A common strategy is to segregate tasks among the team so that work done by each team member takes place in a single call stack. The following exercise shows how to produce a call graph with the Valgrind tool suite and Callgrind. Another tool in the Valgrind suite, either KCacheGrind or QCacheGrind, then displays the results. The only difference is that one uses X11 graphics and the other uses Qt graphics.
另一个有用的分析工具是 Intel® Advisor。这是一个商业工具,具有有用的功能,可从您的应用程序中获得最大性能。英特尔 Advisor 是 Parallel Studio 软件包的一部分,该软件包还捆绑了英特尔编译器、英特尔 Inspector 和 VTune。https://software.intel.com/en-us/qualify-for-free-software/student 提供学生、教育工作者、开源开发人员和试用许可证选项。这些 Intel 工具也已在 https:// software.intel.com/en-us/oneapi 的 OneAPI 包中免费发布。最近,Intel Advisor 添加了一项包含车顶线模型的分析功能。让我们看一下它的运行情况。
Another useful profiling tool is Intel® Advisor. This is a commercial tool with helpful features for getting the most performance from your application. Intel Advisor is part of the Parallel Studio package that also bundles the Intel compilers, Intel Inspector, and VTune. There are options for a student, educator, open source developer and trial licenses at https://software.intel.com/en-us/qualify-for-free-software/student. These Intel tools have also been released for free in the OneAPI package at https:// software.intel.com/en-us/oneapi. Recently, Intel Advisor has added a profiling feature incorporating the roofline model. Let’s take a look at it in operation.
我们还可以使用免费提供的 likwid 工具套件来获得算术强度。Likwid 是“Like I Know What I'm Doing”的首字母缩写词,由埃尔朗根-纽伦堡大学的 Treibig、Hager 和 Wellein 撰写。它是一个命令行工具,仅在 Linux 上运行并利用特定于机器的寄存器 (MSR)。必须使用 modprobe msr 启用 MSR 模块。该工具使用硬件计数器来测量和报告来自系统的各种信息,包括运行时间、时钟频率、能量和功耗使用情况以及内存读写统计数据。
We can also use the freely available likwid tool suite to get an arithmetic intensity. Likwid is an acronym for “Like I Knew What I’m Doing” and is authored by Treibig, Hager, and Wellein at the University of Erlangen-Nuremberg. It is a command-line tool that only runs on Linux and utilizes the machine-specific registers (MSR). The MSR module must be enabled with modprobe msr. The tool uses hardware counters to measure and report various information from the system, including run time, clock frequency, energy and power usage, and memory read and write statistics.
我们还可以使用 likwid 的输出来计算由于并行运行而导致的 CloverLeaf 的能量减少。
We can also use the output from likwid to calculate the energy reduction for CloverLeaf due to running in parallel.
Instrument specific sections of code with likwid-perfctr markers
可以在 likwid 中使用标记来获得一个或多个代码段的性能。此功能将在下一章的 4.2 节中使用。
Markers can be used in likwid to get performance for one or multiple sections of code. This capability will be used in the next chapter in section 4.2.
Listing 3.1 Inserting markers into code to instrument specific sections of code
LIKWID_MARKER_INIT; ❶ LIKWID_MARKER_THREADINIT; LIKWID_MARKER_REGISTER("Compute") ❷ LIKWID_MARKER_START("Compute"); // ... Your code to measure LIKWID_MARKER_STOP("Compute"); LIKWID_MARKER_CLOSE; ❶
LIKWID_MARKER_INIT; ❶ LIKWID_MARKER_THREADINIT; LIKWID_MARKER_REGISTER("Compute") ❷ LIKWID_MARKER_START("Compute"); // ... Your code to measure LIKWID_MARKER_STOP("Compute"); LIKWID_MARKER_CLOSE; ❶
❷ Requires daemon with suid (root) permissions
Generating your own roofline plots
NERSC 的 Charlene Yang 创建并发布了一个用于生成屋顶线图的 Python 脚本。这对于使用探索中的数据生成高质量的自定义图形非常方便。对于这些示例,您可能需要安装 anaconda3 软件包。它包含 matplotlib 库和 Jupyter 笔记本支持。使用以下代码通过 Python 和 matplotlib 自定义车顶线图:
Charlene Yang, NERSC, has created and released a Python script for generating a roofline plot. This is extremely convenient for generating a high-quality, custom graphic with data from your explorations. For these examples, you may want to install the anaconda3 package. It contains the matplotlib library and Jupyter notebook support. Use the following code to customize a roofline plot using Python and matplotlib:
git clone https://github.com/cyanguwa/nersc-roofline.git cd nersc-roofline/Plotting modify data.txt python plot_roofline.py data.txt
git clone https://github.com/cyanguwa/nersc-roofline.git cd nersc-roofline/Plotting modify data.txt python plot_roofline.py data.txt
我们将在几个练习中使用此绘图脚本的修改版本。在第一个任务中,我们将部分车顶线绘图脚本嵌入到 Jupyter Notebook 中。Jupyter Notebooks (https://jupyter.org/install.html) 允许您将 Markdown 文档与 Python 代码穿插在一起,以获得交互式体验。我们使用它来动态计算理论硬件性能,然后创建算术强度和性能的屋顶线图。
We’ll use modified versions of this plotting script in a couple of exercises. In this first one, we embedded parts of the roofline plotting script into a Jupyter notebook. Jupyter notebooks (https://jupyter.org/install.html) allow you to intersperse Markdown documentation with Python code for an interactive experience. We use this to dynamically calculate the theoretical hardware performance and then create a roofline plot of your arithmetic intensity and performance.
绘制这个算术强度和计算速率,得到图 3.6 中的结果。串联和平行运行都绘制在屋顶线上。并行运行速度大约快 15 倍,操作(算术)强度略高。
Plotting this arithmetic intensity and computation rate gives the result in figure 3.6. Both the serial and the parallel runs are plotted on the roofline. The parallel run is about 15 times faster and with slightly higher operational (arithmetic) intensity.
图 3.6 Clover Leaf 在 Skylake Gold 处理器上的整体性能
Figure 3.6 Overall performance of Clover Leaf on a Skylake Gold processor
还有几个工具可以测量算术强度。英特尔®软件开发模拟器 (SDE) 软件包(https://software.intel.com/en-us/ articles/intel-software-development-emulator)生成了大量可用于计算算术强度的信息。Intel® Vtune™ 性能工具(Parallel Studio 软件包的一部分)也可用于收集性能信息。
There are a couple more tools that can measure arithmetic intensity. The Intel® Software Development Emulator (SDE) package (https://software.intel.com/en-us/ articles/intel-software-development-emulator) generates lots of information that can be used to calculate arithmetic intensity. The Intel® Vtune™ performance tool (part of the Parallel Studio package) can also be used to gather performance information.
当我们比较 Intel Advisor 和 likwid 的结果时,算术强度存在差异。有许多不同的方法可以计算操作数,在加载时计算整个缓存行或仅计算使用的数据。同样,计数器可以计算整个向量宽度,而不仅仅是使用的部分。有些工具只计算浮点运算,而另一些工具也计算不同类型的运算(比如整数)。
When we compare the results from Intel Advisor and likwid, there is a difference in the arithmetic intensity. There are many different ways to count operations, counting the whole cache line when loaded or just the data used. Similarly, the counters can count the entire vector width and not just the part that is used. Some tools count just floating-point operations, whereas others count different types of operations (such as integer) as well.
最新的处理器具有许多硬件性能计数器和控制功能。这些因素包括处理器频率、温度、功率和许多其他因素。新的软件应用程序和库不断涌现,以便更轻松地访问这些信息。这些应用程序缓解了编程难度,但它们也可能有助于解决对提升权限的需求,以便普通用户更容易访问数据。这是一个受欢迎的发展,因为程序员无法优化他们看不到的东西。
Recent processors have a lot of hardware performance counters and control capabilities. These include processor frequency, temperature, power, and many others. New software applications and libraries are emerging to make accessing this information easier. These applications ease the programming difficulty, but these may also help work around the need for elevated permissions so that the data is more accessible to normal users. This is a welcome development because programmers cannot optimize what they cannot see.
由于对处理器频率的严格管理,处理器很少处于其标称频率设置。当处理器处于空闲状态时,时钟频率会降低,当处理器处于忙碌状态时,时钟频率会增加到 turbo-boost 模式。查看处理器频率行为的两个简单交互式命令是
With the aggressive management of processor frequency, processors seldom are at their nominal frequency setting. The clock frequency is reduced when processors are at idle and increased to a turbo-boost mode when busy. Two easy interactive commands to see the behavior of the processor frequency are
watch -n 1 "lscpu | grep MHz" watch -n 1 "grep MHz /proc/cpuinfo"
watch -n 1 "lscpu | grep MHz" watch -n 1 "grep MHz /proc/cpuinfo"
likwid 工具套件还有一个命令行工具 likwid-powermeter,用于查看处理器频率和功耗统计数据。likwid-perfctr 工具还会在摘要报告中报告其中一些统计信息。另一个方便的小应用程序是 Intel® Power Gadget,它有适用于 Mac 和 Windows 的版本,以及适用于 Linux 的更有限的版本。它绘制频率、功率、温度和利用率的图表。
The likwid tool suite also has a command-line tool, likwid-powermeter, to look at processor frequencies and power statistics. The likwid-perfctr tool also reports some of these statistics in a summary report. Another handy little app is the Intel® Power Gadget, with versions for the Mac and Windows and a more limited one for Linux. It graphs frequency, power, temperature, and utilization.
CLAMR 迷你应用程序 (http://www.github.com/LANL/CLAMR.git) 正在开发一个小型库 PowerStats,它将跟踪应用程序内部的能量和频率,并在运行结束时报告。目前,PowerStats 使用 Intel Power Gadget 库接口在 Mac 上运行。正在为 Linux 系统开发类似的功能。应用程序代码只需要添加几个调用,如下面的清单所示。
The CLAMR mini-app (http://www.github.com/LANL/CLAMR.git) is developing a small library, PowerStats, that will track energy and frequency from within an application and report it at the end of the run. Currently, PowerStats works on the Mac, using the Intel Power Gadget library interface. A similar capability is being developed for Linux systems. The application code needs to add just a few calls as shown in the following listing.
列表 3.2 用于跟踪能量和频率的 PowerStats 代码
Listing 3.2 PowerStats code to track energy and frequency
powerstats_init(); ❶ powerstats_sample(); ❷ powerstats_finalize(); ❸
powerstats_init(); ❶ powerstats_sample(); ❷ powerstats_finalize(); ❸
❷ 在计算期间(例如,每 100 次迭代)或针对不同阶段定期声明
❷ Declare periodically during calculation (for example, every 100 iterations) or for different phases
When run, the following table is printed:
Processor Energy(mWh) = 94.47181 IA Energy(mWh) = 70.07562 DRAM Energy(mWh) = 3.09289 Processor Power (W) = 71.07833 IA Power (W) = 54.73608 DRAM Power (W) = 2.32194 Average Frequency = 3721.19422 Average Temperature (C) = 94.78369 Time Expended (secs) = 12.13246
Processor Energy(mWh) = 94.47181 IA Energy(mWh) = 70.07562 DRAM Energy(mWh) = 3.09289 Processor Power (W) = 71.07833 IA Power (W) = 54.73608 DRAM Power (W) = 2.32194 Average Frequency = 3721.19422 Average Temperature (C) = 94.78369 Time Expended (secs) = 12.13246
内存使用也是程序员不容易看到的性能的另一个方面。您可以对 processor frequency 使用与上一个清单中相同的 interactive 命令,但可以用于内存统计信息。首先,从顶部或 ps 命令获取您的进程 ID。然后使用以下命令之一来跟踪内存使用情况:
Memory usage is also another aspect of performance that isn’t easily visible to the programmer. You can use the same sort of interactive command for processor frequency as in the previous listing, but for memory statistics instead. First, get your process ID from the top or the ps command. Then use one of the following commands to track memory usage:
watch -n 1 "grep VmRSS /proc/<pid>/status" watch -n 1 "ps <pid>" top -s 1 -p <pid>
watch -n 1 "grep VmRSS /proc/<pid>/status" watch -n 1 "ps <pid>" top -s 1 -p <pid>
为了将其集成到您的程序中,也许是为了查看内存在不同阶段会发生什么,CLAMR 中的 MemSTATS 库提供了四种不同的内存跟踪调用:
To integrate this into your program, perhaps to see what happens with memory in different phases, the MemSTATS library in CLAMR provides four different memory-tracking calls:
long long memstats_memused() long long memstats_mempeak() long long memstats_memfree() long long memstats_memtotal()
long long memstats_memused() long long memstats_mempeak() long long memstats_memfree() long long memstats_memtotal()
将这些调用插入到程序中,以返回调用时的当前内存统计信息。MemSTATS 是单个 C 源代码和头文件,因此它应该很容易集成到您的程序中。要获取源代码,请转到 http://github.com/LANL/ CLAMR/ 并查看 MemSTATS 目录。它还位于 https://github .com/EssentialsofParallelComputing/Chapter3 的代码示例中。
Insert these calls into your program to return the current memory statistics at the point of the call. MemSTATS is a single C source and header file, so it should be easy to integrate into your program. To get the source, go to http://github.com/LANL/ CLAMR/ and look in the MemSTATS directory. It is also available at https://github .com/EssentialsofParallelComputing/Chapter3 in the code samples.
本章仅介绍了所有这些工具可以做什么的表面。有关更多信息,请浏览其他阅读部分中的以下资源,并尝试一些练习。
This chapter only brushes the surface of what all these tools can do. For more information, explore the following resources in the additional reading section and try some of the exercises.
您可以在此处找到有关 STREAM Benchmark 的更多信息和数据:
You can find more information and data on the STREAM Benchmark here:
约翰·麦卡尔平。1995. “STREAM:高性能计算机中的可持续内存带宽”。https://www.cs.virginia.edu/stream/。
John McCalpin. 1995. “STREAM: Sustainable Memory Bandwidth in High Performance Computers.” https://www.cs.virginia.edu/stream/.
屋顶线模型起源于劳伦斯伯克利国家实验室。他们的网站上有许多资源来探索它的用途:
The roofline model originated at Lawrence Berkeley National Laboratory. Their website has many resources exploring its use:
“车顶性能模型。”https://crd.lbl.gov/departments/computer-science/ PAR/research/roofline/。
“Roofline Performance Model.” https://crd.lbl.gov/departments/computer-science/ PAR/research/roofline/.
Calculate the theoretical performance of a system of your choice. Include the peak flops, memory bandwidth, and machine balance in your calculation.
从 cs-roofline-toolkit.git 下载 Roofline Toolkit https://bitbucket.org/berkeleylab/ 并测量所选系统的实际性能。
Download the Roofline Toolkit from https://bitbucket.org/berkeleylab/ cs-roofline-toolkit.git and measure the actual performance of your selected system.
With the Roofline Toolkit, start with one processor and incrementally add optimization and parallelization, recording how much improvement you get at each step.
从 https://www.cs.virginia.edu/stream/ 下载 STREAM 基准测试并测量所选系统的内存带宽。
Download the STREAM Benchmark from https://www.cs.virginia.edu/stream/ and measure the memory bandwidth of your selected system.
Pick one of the publicly available benchmarks or mini-apps listed in section 17.1 and generate a call graph using KCacheGrind.
选择第 17.1 节中列出的公开可用的基准测试或迷你应用程序之一,并使用英特尔 Advisor 或 likwid 工具测量其算术强度。
Pick one of the publicly available Benchmarks or mini-apps listed in section 17.1 and measure its arithmetic intensity with either Intel Advisor or the likwid tools.
Using the performance tools presented in this chapter, determine the average processor frequency and energy consumption for a small application.
Using some of the tools from section 3.3.3, determine how much memory an application uses.
本章涵盖了很多内容,并为并行项目计划提供了许多必要的细节。估计性能能力并使用工具提取有关硬件特性和应用程序性能的信息,为填充计划提供了可靠、具体的数据点。正确使用这些工具和技能有助于为成功的并行项目奠定基础。
This chapter has covered a lot of ground with many necessary details for a parallel project plan. Estimating performance capabilities and using tools to extract information on hardware characteristics and application performance give solid, concrete data points to populate the plan. The proper use of these tools and skills can help build a foundation for a successful parallel project.
There are several possible performance limitations for an application. These range from the peak number of floating-point operations (flops) to memory bandwidth and hard disk reads and writes.
当前计算系统上的应用程序通常比 flops 更容易受到内存带宽的限制。尽管它在二十年前就被发现,但它已经变得比当时预测的更加真实。但计算科学家在适应这一新现实方面进展缓慢。
Applications on current computing systems are generally more limited by memory bandwidth than flops. Although identified two decades ago, it has become even more true than projected at that time. But computational scientists have been slow to adapt their thinking to this new reality.
您可以使用分析工具来测量应用程序性能,并确定将优化和并行化工作重点放在何处。本章展示了使用 Intel® Advisor、Valgrind、Callgrind 和 likwid 的示例,但还有许多其他工具,包括 Intel® VTune、Open|Speedshop (O|SS)、HPC Toolkit 或 Allinea/ARM MAP 的 S Package。(第 17.3 节中给出了更完整的列表。然而,最有价值的工具是那些提供可操作信息而不是数量的工具。
You can use profiling tools to measure your application performance and to determine where to focus optimization and parallelization work. This chapter shows examples using Intel® Advisor, Valgrind, Callgrind, and likwid, but there are many other tools including Intel® VTune, Open|Speedshop (O|SS), HPC Toolkit, or Allinea/ARM MAP. (A more complete list is given in section 17.3.) However, the most valuable tools are those that provide actionable information rather than quantity.
您可以使用硬件性能实用程序和应用程序来确定能耗、处理器频率、内存使用情况等。通过使这些性能属性更加明显,可以更轻松地针对这些注意事项进行优化。
You can use hardware performance utilities and apps to determine energy consumption, processor frequency, memory usage, and much more. By making these performance attributes more visible, it becomes easier to optimize for these considerations.
本章有两个密切相关的主题:(1) 介绍越来越受数据移动主导的性能模型,因此必然 (2) 数据的底层设计和结构。尽管它看起来似乎次要于性能,但数据结构及其设计至关重要。这必须提前确定,因为它决定了算法、代码以及后来的并行实现的整个形式。
This chapter has two topics that are intimately coupled: (1) the introduction of performance models increasingly dominated by data movement and, thus, necessarily (2) the underlying design and structure of data. Although it may seem secondary to performance, the data structure and its design are critical. This must be determined in advance because it dictates the entire form of the algorithms, code, and later, the parallel implementation.
数据结构的选择以及数据布局的选择通常决定了您可以实现的性能,并且在做出设计决策时并不总是很明显。考虑数据布局及其性能影响是一种新的、不断发展的编程方法的核心,称为面向数据的设计。这种方法考虑了数据在程序中的使用模式,并围绕它主动进行设计。面向数据的设计为我们提供了一个以数据为中心的世界观,这也与我们关注内存带宽而不是浮点运算 (flops) 是一致的。总之,对于性能,我们的方法是考虑
The choice of data structures and, thereby, the data layout often determines the performance that you can achieve and in ways that are not always obvious when the design decisions are made. Thinking about the data layout and its performance impacts is at the core of a new and growing programming approach called data-oriented design. This approach considers the patterns of how data will be used in the program and proactively designs around it. Data-oriented design gives us a data-centric view of the world, which is also consistent with our focus on memory bandwidth rather than floating-point operations (flops). In summary, for performance, our approach is to think about
基于数据结构和自然而然的算法的简单性能模型可以粗略地预测性能。性能模型是计算机系统如何在代码内核中执行操作的简化表示。我们使用简化模型,因为推断计算机操作的全部复杂性很困难,并且掩盖了我们需要考虑的性能关键方面。这些简化的模型应该捕获对性能最重要的计算机操作方面。此外,每个计算机系统的操作细节都各不相同。因为我们希望我们的应用程序在广泛的系统上运行,所以我们需要一个模型来抽象所有系统共有的操作的一般视图。
Simple performance models based on the data structures and the algorithms that naturally follow can roughly predict performance. A performance model is a simplified representation of how a computer system executes the operations in a kernel of code. We use simplified models because reasoning about the full complexity of the computer operation is difficult and obscures the key aspects we need to think about for performance. These simplified models should capture the computer’s operational aspects that are most important for performance. Also, every computer system varies in the details of its operation. Because we want our application to run on a wide range of systems, we need a model that abstracts a general view of the operations that all systems have in common.
模型帮助我们了解内核性能的当前功能。它有助于建立对性能的预期,以及它如何随着代码的更改而改进。对代码的更改可能是一项艰巨的工作,在开始之前,我们想知道结果应该是什么。它还有助于我们专注于应用程序性能的关键因素和资源。
A model helps us to understand the current functioning of our kernel performance. It helps build expectations for the performance and how it might improve with changes to the code. Changes to the code can be a lot of work, and we’ll want to know what the result should be before embarking on the effort. It also helps us to focus on the critical factors and resources for our application’s performance.
性能模型不仅限于 flops,事实上,我们将专注于数据和内存方面。除了 flops 和 memory 操作之外,整数运算、指令和指令类型可能很重要,应该计算在内。但是,与这些其他注意事项相关的限制通常跟踪内存性能,并且可以被视为该限制对性能的小幅降低。
A performance model is not limited to flops, and indeed, we will focus on the data and memory aspects. In addition to flops and memory operations, integer operations, instructions and instruction types can be important and should be counted. But the limits associated with these additional considerations usually track memory performance and can be treated as a small reduction in the performance from that limit.
本章的第一部分着眼于简单的数据结构以及它们如何影响性能。接下来,我们将介绍用于快速做出设计决策的性能模型。然后,这些性能模型在案例研究中使用,以观察压缩稀疏多材料数组的更复杂的数据结构,以评估哪种数据结构可能表现良好。这些决策对数据结构的影响通常会在项目后期出现,此时更改要困难得多。本章的最后一部分重点介绍高级编程模型;它介绍了更复杂的模型,这些模型适用于更深入地研究性能问题或了解计算机硬件及其设计如何影响性能。让我们深入研究一下在查看代码和性能问题时这意味着什么。
The first part of the chapter looks at simple data structures and how these impact performance. Next, we’ll introduce performance models to use for quickly making design decisions. These performance models are then put to use in a case study to look at more complicated data structures for compressed, sparse multi-material arrays to assess which data structure is likely to perform well. The impact of these decisions on data structures often shows up much later in the project when changes are far more difficult. The last portion of this chapter focuses on advanced programming models; it introduces the more complex models that are appropriate for deeper dives into performance issues or understanding how computer hardware and its design influences performance. Let’s dig into what this means when looking at your code and performance issues.
注意我们鼓励您按照 https://github.com/EssentialsofParallelComputing/Chapter4 中的本章示例进行操作。
Note We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/Chapter4.
我们的目标是设计出能够带来良好性能的数据结构。我们将从一种分配多维数组的方法开始,然后继续介绍更复杂的数据结构。要实现这一目标,需要
Our goal is to design data structures that lead to good performance. We’ll start with a way to allocate multidimensional arrays and then move on to more complex data structures. To achieve this goal requires
在大多数现代编程语言中,数据被分组为这样或那样的结构。例如,使用 C 中的数据结构或面向对象编程(也称为 OOP)中的类将相关项放在一起,以便于组织源代码。类的成员与对其进行操作的方法聚集在一起。虽然从程序员的角度来看,面向对象编程的理念提供了很多价值,但它完全忽略了 CPU 的运行方式。面向对象编程导致频繁的方法调用,中间只有几行代码(图 4.1)。
In most modern programming languages, data is grouped in structures of one kind or another. For example, the use of data structures in C or classes in object-oriented programming (also called OOP) bring related items together for the convenience of organizing the source code. The members of the class are gathered together with the methods that operate on it. While the philosophy of object-oriented programming offers a lot of value from a programmer’s perspective, it completely ignores how the CPU operates. Object-oriented programming results in frequent method calls with few lines of code in between (figure 4.1).
对于方法调用,必须首先将类引入缓存。接下来,将数据放入缓存中,然后是类的相邻元素。当您在一个对象上操作时,这很方便。但对于具有密集计算的应用程序,每个项目都有大量。对于这些情况,我们不希望一次对一个项目调用方法,每次调用都需要遍历深度调用堆栈。这会导致指令缓存未命中、数据缓存使用率不佳、分支和大量函数调用开销。
For method invocation, the class must first be brought into the cache. Next the data is brought into cache and then adjacent elements of the class. This is convenient when you are operating on one object. But for applications with intensive computations, there are large numbers of each item. For these situations, we don’t want to invoke a method on one item at a time with each invocation requiring the transversal of a deep call stack. These lead to instruction cache misses, poor data cache usage, branching, and lots of function call overhead.
C++ 方法使编写简洁的代码变得更加容易,但几乎每一行都是一个方法调用,如图 4.1 所示。在数值仿真代码中,Draw_Line 调用很可能是一个复杂的数学表达式。但即使在这里,如果 Draw_Line 函数内联到源代码中,则不会跳转到 C 代码的函数。Inlining 是编译器将源代码从子例程复制到使用它的位置,而不是对其进行调用的位置。但是,编译器只能内联简单、简短的例程。但是,由于复杂性和深层调用堆栈,面向对象代码具有不会内联的方法调用。这会导致指令缓存未命中和其他性能问题。如果我们只绘制一个窗口,则性能损失会被更简单的编程所抵消。如果我们要绘制 100 万个窗口,我们就无法承受性能损失。
C++ methods make it much easier to write concise code, but nearly every line is a method invocation as figure 4.1 illustrates. In numerical simulation code, the Draw_Line call would more than likely be a complex mathematical expression. But even here, if the Draw_Line function is inlined into the source code, there will be no jumps into functions for the C code. Inlining is where the compiler copies the source from a subroutine into the location where it is used rather than making a call to it. The compiler can only inline for simple, short routines, however. But object-oriented code has method calls that won’t inline because of complexity and deep call stacks. This causes instruction cache misses and other performance issues. If we are only drawing one window, the loss in performance is offset by the simpler programming. If we are going to draw a million windows, we can’t afford the performance hit.
图 4.1 面向对象语言具有具有大量方法调用的深调用堆栈(如左图所示),而过程语言在调用堆栈的一个级别上具有较长的操作序列。
Figure 4.1 Object-oriented languages have deep call stacks with lots of method calls (shown on the left), while procedural languages have long sequences of operations at one level of the call stack.
因此,让我们反过来,设计数据结构是为了提高性能,而不是为了编程的便利性。面向对象的编程和其他现代编程风格功能强大,但会引入许多性能陷阱。在 2014 年的 CppCon 上,Mike Acton 的演讲“面向数据的设计和 C++”总结了游戏行业的工作,这些工作确定了现代编程风格阻碍性能的原因。面向数据的设计编程风格的倡导者通过创建一种直接关注性能的编程风格来解决这个问题。这种方法是面向数据的设计,它侧重于 CPU 和缓存的最佳数据布局。这种风格与高性能计算 (HPC) 开发人员长期使用的技术有很多共同之处。在 HPC 中,面向数据的设计是常态;它自然而然地遵循了人们在 Fortran 中编写程序的方式。那么,面向数据的设计是什么样的呢?它
So let’s flip this around and design our data structures for performance rather than programming convenience. Object-oriented programming and other modern programming styles are powerful but introduce many performance traps. At CppCon in 2014, Mike Acton’s presentation, “Data-oriented design and C++,” summarized work from the gaming industry that identified why modern programming styles impede performance. Advocates of the data-oriented design programming style address this issue by creating a programming style that focuses squarely on performance. This approach is coined data-oriented design, which focuses on the best data layout for the CPU and the cache. This style has much in common with the techniques long used by high-performance computing (HPC) developers. In HPC, data-oriented design is the norm; it follows naturally from the way people wrote programs in Fortran. So, what does data-oriented design look like? It
Operates on arrays, not individual data items, avoiding the call overhead and the instruction and data cache misses
Prefers arrays rather than structures for better cache usage in more situations
Inlines subroutines rather than transversing a deep call hierarchy
Controls memory allocation, avoiding undirected reallocation behind the scenes
Uses contiguous array-based linked lists to avoid the standard linked list implementations used in C and C++, which jump all over memory with poor data locality and cache usage
在接下来的章节中,我们将介绍并行化,我们将注意到,我们的经验表明,大型数据结构或类也会导致共享内存并行化和向量化问题。在共享内存编程中,我们需要能够将变量标记为线程的私有变量或所有线程的全局变量。但目前,数据结构中的所有项目都具有相同的属性。在增量引入 OpenMP 并行化期间,问题尤其严重。在实现向量化时,我们需要长数组的同构数据,而类通常对异构数据进行分组。这让事情变得复杂起来。
As we move into parallelization in the next chapters, we’ll note that our experience shows that large data structures or classes also cause problems with shared memory parallelization and vectorization. In shared memory programming, we need to be able to mark variables as private to a thread or as global across all threads. But currently, all the items in the data structure have the same attribute. The problem is particularly acute during incremental introduction of OpenMP parallelization. When implementing vectorization, we want long arrays of homogeneous data, while classes usually group heterogeneous data. This complicates things.
在本节中,我们将介绍科学计算中无处不在的多维数组数据结构。我们的目标是了解
In this section, we’ll cover the ubiquitous multidimensional array data structure in scientific computing. Our goal will be to understand
处理多维数组是性能方面最常见的问题。图 4.2 中的前两个子图显示了传统的 C 和 Fortran 数据布局。
Handling multidimensional arrays is the most common problem with regard to performance. The first two subfigures in figure 4.2 show the conventional C and Fortran data layouts.
图 4.2 传统的 C 排序是行优先的,而 Fortran 排序是列优先的。切换 Fortran 或 C 索引顺序可使它们兼容。请注意,约定的 Fortran 数组索引从 1 开始,而 C 从 0 开始。此外,C 约定按连续顺序对从 0 到 15 的元素进行编号。
Figure 4.2 Conventional C ordering is row major while Fortran ordering is column major. Switching either the Fortran or C index order makes these compatible. Note that convention has Fortran array indices starting at 1 while C starts at 0. Also, C convention numbers the elements from 0 to 15 in contiguous order.
C 数据顺序称为 行主 序,其中行中的数据比列中的数据变化得更快。这意味着行数据在内存中是连续的。相比之下,Fortran 数据布局是列优先布局,其中列数据变化最快。实际上,作为程序员,我们必须记住在内循环中应该有哪个索引,以便在每种情况下利用连续内存(图 4.3)。
The C data order is referred to as row major, where data across the row varies faster than data in the column. This means that row data is contiguous in memory. In contrast, the Fortran data layout is column major, where the column data varies fastest. Practically, as programmers, we must remember which index should be in the inner loop to leverage the contiguous memory in each situation (figure 4.3).
图 4.3 对于 C 语言,要记住的重要一点是最后一个索引变化最快,应该是嵌套循环的内循环。对于 Fortran,第一个索引变化最快,应该是嵌套循环的内循环。
Figure 4.3 For C, the important thing to remember is that the last index varies fastest and should be the inner loop of a nested loop. For Fortran, the first index varies fastest and should be the inner loop of a nested loop.
除了语言之间数据排序的差异之外,还必须考虑另一个问题。整个 2D 数组的内存是连续的吗?Fortran 不保证内存是连续的,除非您对数组使用 CONTIGUOUS 属性,如下例所示:
Beyond the differences in data ordering between languages, there is a further issue that must be considered. Is the memory for the whole 2D array contiguous? Fortran doesn’t guarantee that the memory is contiguous unless you use the CONTIGUOUS attribute on the array as this example shows:
real, allocatable, contiguous :: x(:,:)
real, allocatable, contiguous :: x(:,:)
在实践中,使用 contiguous 属性并不像看起来那么重要。所有常用的 Fortran 编译器都会为具有或不包含此属性的数组分配连续内存。可能的例外是缓存性能的填充或使用 slice 运算符通过子例程接口传递数组。切片运算符是 Fortran 中的一种构造,它允许您引用数组的子集,例如,使用语法 y(:)= x(1,:) 将二维数组的一行复制到一维数组。Slice 运算符也可以在 subroutine 调用中使用;例如
In practice, using the contiguous attribute is not as critical as it might seem. All popular Fortran compilers allocate contiguous memory for arrays with or without this attribute. The possible exceptions are padding for cache performance or passing an array through a subroutine interface with a slice operator. A slice operator is a construct in Fortran that allows you to refer to a subset of an array as in the example of a copy of a row of a 2D array to a 1D array with the syntax y(:)= x(1,:). Slice operators can also be used in a subroutine call; for example,
call write_data_row(x(1,:))
call write_data_row(x(1,:))
一些 research 编译器通过简单地修改数组的 dope 向量中数据元素之间的步幅来处理这个问题。在 Fortran 中,摄影向量是数组的元数据,其中包含每个维度的起始位置、数组长度和元素之间的步幅。在这种情况下,Dope 来自某人或某物(在本例中为 array)的 “give me the dope (info)” 。图 4.4 说明了摄影向量、切片运算符和步幅的概念。其思路是,通过将摄影向量中的步幅从 1 修改为 4,然后将数据作为行而不是列遍历。但在实践中,生产 Fortran 编译器通常会创建数据的副本并将其传递到子例程中,以避免破坏需要连续数据的代码。这也意味着在调用 Fortran 子例程时应避免使用 slice 运算符,因为隐藏副本及其产生的性能成本。
Some research compilers handle this by simply modifying the stride between data elements in the dope vector for the array. In Fortran, the dope vector is the metadata for the array containing the start location, length of the array, and the stride between elements for each dimension. Dope in this context is from “give me the dope (info)” on someone or something (in this case, the array). Figure 4.4 illustrates the concepts of a dope vector, the slice operator, and stride. The idea is that by modifying the stride in the dope vector from 1 to 4, the data is then traversed as a row rather than a column. But in practice, production Fortran compilers usually make a copy of the data and pass it into the subroutine to avoid breaking code that is expecting contiguous data. This also means that you should avoid using the slice operator in calling Fortran subroutines because of the hidden copy and its resulting performance cost.
图 4.4 通过修改摄影向量创建的 Fortran 数组的不同视图,摄影向量是一组描述每个维度的开始、步幅和长度的元数据。切片运算符返回 Fortran 数组的一部分,其中包含维度中的所有元素,并带有冒号 (:)。可以创建更复杂的部分,例如使用 A(1:2,1:2) 的下四个元素,其中上限和下限用冒号指定。
Figure 4.4 Different views of a Fortran array created by modifying the dope vector, a set of metadata describing the start, stride, and length in each dimension. The slice operator returns a section of a Fortran array with all of the elements in the dimension with the colon (:). More complicated sections can be created, such as the lower four elements with A(1:2,1:2), where the upper and lower bounds are specified with the colon.
C 在 2D 数组的连续内存方面有自己的问题。这是由于在 C 语言中动态分配 2D 数组的传统方式,如下面的清单所示。
C has its own issues with contiguous memory for a 2D array. This is due to the conventional way of dynamically allocating a 2D array in C as shown in the following listing.
Listing 4.1 Conventional way of allocating a 2D array in C
8 double **x =
(double **)malloc(jmax*sizeof(double *)); ❶
9
10 for (j=0; j<jmax; j++){
11 x[j] =
(double *)malloc(imax*sizeof(double)); ❷
12 }
13
14 // computation
15
16 for (j=0; j<jmax; j++){
17 free(x[j]); ❸
18 }
19 free(x); ❸ 8 double **x =
(double **)malloc(jmax*sizeof(double *)); ❶
9
10 for (j=0; j<jmax; j++){
11 x[j] =
(double *)malloc(imax*sizeof(double)); ❷
12 }
13
14 // computation
15
16 for (j=0; j<jmax; j++){
17 free(x[j]); ❸
18 }
19 free(x); ❸
❶ Allocates a column of pointers of type pointer to double
此清单使用 1+jmax 分配,每个分配可以来自堆中的不同位置。对于较大尺寸的 2D 数组,内存中数据的布局对高速缓存效率的影响很小。更大的问题是非连续数组的使用受到严重限制;不可能将这些传递给 Fortran,将它们以块的形式写入文件,然后将这些传递给 GPU 或其他处理器。相反,这些操作中的每一个都需要逐行完成。幸运的是,有一种简单的方法可以为 C 数组分配一个连续的内存块。为什么它不是标准做法?这是因为每个人都学习了清单 4.1 中的常规方法,并且没有考虑它。下面的清单显示了如何为 2D 数组分配一个连续的内存块。
This listing uses 1+jmax allocations, and each allocation can come from a different place in the heap. With larger-sized 2D arrays, the layout of the data in memory has only a small impact on cache efficiency. The bigger problem is that the use of noncontiguous arrays is severely limited; it’s impossible to pass these to Fortran, write those in a block to a file, and then pass these to a GPU or to another processor. Instead, each of these operations needs to be done row by row. Fortunately, there is an easy way to allocate a contiguous block of memory for C arrays. Why isn’t it standard practice? It’s because everyone learns the conventional method as in listing 4.1 and doesn’t think about it. The following listing shows how to allocate a contiguous block of memory for a 2D array.
Listing 4.2 Allocating a contiguous 2D array in C
8 double **x = 9 (double **)malloc(jmax*sizeof(double *)); ❶ 10 11 x[0] = (void *)malloc(jmax*imax*sizeof(double)); ❷ 12 13 for (int j = 1; j < jmax; j++) { 14 x[j] = x[j-1] + imax; ❸ 15 } 16 17 // computation 18 19 free(x[0]); ❹ 20 free(x); ❹
8 double **x = 9 (double **)malloc(jmax*sizeof(double *)); ❶ 10 11 x[0] = (void *)malloc(jmax*imax*sizeof(double)); ❷ 12 13 for (int j = 1; j < jmax; j++) { 14 x[j] = x[j-1] + imax; ❸ 15 } 16 17 // computation 18 19 free(x[0]); ❹ 20 free(x); ❹
❶ Allocates a block of memory for the row pointers
❷ Allocates a block of memory for the 2D array
❸ Assigns the memory location to point to the data block for each row pointer
这种方法不仅为您提供了一个连续的内存块,而且只需要两次内存分配!我们可以进一步优化这一点,方法是在清单 4.2 的第 11 行的连续内存分配开始时将行指针捆绑到内存块中,从而将两个内存分配合二为一(图 4.5)。
This method not only gives you a contiguous memory block, but it also only takes two memory allocations! We can optimize this even further by bundling the row pointers into the memory block at the start of the contiguous memory allocation on line 11 of listing 4.2, thereby combining the two memory allocations into one (figure 4.5).
图 4.5 在 C 语言中,一个连续的内存块变成一个 2D 数组。
Figure 4.5 A contiguous block of memory becomes a 2D array in C.
下面的清单显示了在 malloc2D.c 中为 2D 数组进行单个连续内存分配的实现。
The following listing shows the implementation of a single contiguous memory allocation for a 2D array in malloc2D.c.
Listing 4.3 Single contiguous memory allocation for a 2D array
malloc2D.c
1 #include <stdlib.h>
2 #include "malloc2D.h"
3
4 double **malloc2D(int jmax, int imax)
5 {
6 double **x = (double **)malloc(jmax*sizeof(double *) +
7 jmax*imax*sizeof(double)); ❶
8
9 x[0] = (double *)x + jmax; ❷
10
11 for (int j = 1; j < jmax; j++) {
12 x[j] = x[j-1] + imax; ❸
13 }
14
15 return(x);
16 }
malloc2D.h
1 #ifndef MALLOC2D_H
2 #define MALLOC2D_H
3 double **malloc2D(int jmax, int imax);
4 #endifmalloc2D.c
1 #include <stdlib.h>
2 #include "malloc2D.h"
3
4 double **malloc2D(int jmax, int imax)
5 {
6 double **x = (double **)malloc(jmax*sizeof(double *) +
7 jmax*imax*sizeof(double)); ❶
8
9 x[0] = (double *)x + jmax; ❷
10
11 for (int j = 1; j < jmax; j++) {
12 x[j] = x[j-1] + imax; ❸
13 }
14
15 return(x);
16 }
malloc2D.h
1 #ifndef MALLOC2D_H
2 #define MALLOC2D_H
3 double **malloc2D(int jmax, int imax);
4 #endif
❶ Allocates a block of memory for the row pointers and the 2D array
❷ Assigns the start of the memory block for the 2D array after the row pointers
❸ Assigns the memory location to point to the data block for each row pointer
现在我们只有一个内存块,包括行指针数组。这应该可以提高内存分配和高速缓存效率。该数组也可以索引为 1D 或 2D 数组,如清单 4.4 所示。一维数组减少了整数地址计算,并且更容易向量化或线程化(当我们在第 6 章和第 7 章中谈到这一点时)。该清单还显示了到 1D 数组中的手动 2D 索引计算。
Now we have only one memory block, including the row pointer array. This should improve memory allocation and cache efficiency. The array can also be indexed as a 1D or a 2D array as shown in listing 4.4. The 1D array reduces the integer address calculation and is easier to vectorize or thread (when we come to that in chapters 6 and 7). The listing also shows a manual 2D index calculation into a 1D array.
Listing 4.4 1D and 2D access of contiguous 2D array
calc2d.c
1 #include "malloc2D.h"
2
3 int main(int argc, char *argv[])
4 {
5 int i, j;
6 int imax=100, jmax=100;
7
8 double **x = (double **)malloc2D(jmax,imax);
9
10 double *x1d=x[0]; ❶
11 for (i = 0; i< imax*jmax; i++){ ❶
12 x1d[i] = 0.0; ❶
13 } ❶
14
15 for (j = 0; j< jmax; j++){ ❷
16 for (i = 0; i< imax; i++){ ❷
17 x[j][i] = 0.0; ❷
18 } ❷
19 } ❷
20
21 for (j = 0; j< jmax; j++){ ❸
22 for (i = 0; i< imax; i++){ ❸
23 x1d[i + imax * j] = 0.0; ❸
24 } ❸
25 } ❸
26 }calc2d.c
1 #include "malloc2D.h"
2
3 int main(int argc, char *argv[])
4 {
5 int i, j;
6 int imax=100, jmax=100;
7
8 double **x = (double **)malloc2D(jmax,imax);
9
10 double *x1d=x[0]; ❶
11 for (i = 0; i< imax*jmax; i++){ ❶
12 x1d[i] = 0.0; ❶
13 } ❶
14
15 for (j = 0; j< jmax; j++){ ❷
16 for (i = 0; i< imax; i++){ ❷
17 x[j][i] = 0.0; ❷
18 } ❷
19 } ❷
20
21 for (j = 0; j< jmax; j++){ ❸
22 for (i = 0; i< imax; i++){ ❸
23 x1d[i + imax * j] = 0.0; ❸
24 } ❸
25 } ❸
26 }
❶ 1D access of the contiguous 2D array
❷ 2D access of the contiguous 2D array
❸ Manual 2D index calculation for a 1D array
Fortran 程序员认为该语言中对多维数组的一流处理是理所当然的。尽管 C 和 C++ 已经存在了几十年,但它们仍然没有在语言中内置原生多维数组。C++ 标准提议为 2023 版添加原生多维数组支持(参见附录 A 中的 Hollman 等人参考)。在那之前,清单 4.4 中介绍的多维数组内存分配是必不可少的。
Fortran programmers take for granted the first-class treatment of multidimensional arrays in the language. Although C and C++ have been around for decades, these still do not have a native multidimensional array built into the language. There are proposals to the C++ standard to add native multidimensional array support for the 2023 revision (see the Hollman, et al. reference in appendix A). Until then, the multidimensional array memory allocation covered in listing 4.4 is essential.
在本节中,我们将介绍结构和类对数据布局的影响。我们的目标是了解
In this section, we’ll cover the implications of structures and classes on data layout. Our goals are to understand
有两种不同的方法可以将相关数据组织到数据集合中。这些是结构数组 (AoS),其中数据被收集到最低级别的单个单元中,然后由结构组成一个数组,或者数组结构 (SoA),其中每个数据数组都位于最低级别,然后由数组组成一个结构。第三种方式是这两种数据结构的混合体,是数组结构数组 (AoSoA)。我们将在 4.1.3 节中讨论这种混合数据结构。
There are two different ways to organize related data into data collections. These are the Array of Structures (AoS), where the data is collected into a single unit at the lowest level and then an array is made of the structure, or the Structure of Arrays (SoA), where each data array is at the lowest level and then a structure is made of the arrays. A third way, which is a hybrid of these two data structures, is the Array of Structures of Arrays (AoSoA). We will discuss this hybrid data structure in section 4.1.3.
AoS 的一个常见示例是用于绘制图形对象的颜色值。下面的清单显示了 C 语言中的红色、绿色、蓝色 (RGB) 颜色系统结构。
One common example of an AoS is the color values used to draw graphic objects. The following listing shows the red, green, blue (RGB) color system structure in C.
Listing 4.5 Array of Structures (AoS) in C
1 struct RGB {
2 int R; ❶
3 int G; ❶
4 int B; ❶
5 };
6 struct RGB polygon_color[1000]; ❷1 struct RGB {
2 int R; ❶
3 int G; ❶
4 int B; ❶
5 };
6 struct RGB polygon_color[1000]; ❷
❶ Defines a scalar color value
❷ Defines an Array of Structures (AoS)
清单 4.5 显示了一个 AoS,其中数据在内存中布局(图 4.6)。在图中,请注意字节 12、28 和 44 处的空白处,编译器在其中插入填充以获得 64 位边界(128 位或 16 字节)上的内存对齐。一个 64 字节的 cache 行包含结构的四个值。然后在第 6 行中,我们创建由 1,000 个 RGB 数据结构类型组成的 polygon_color 数组。此数据布局是合理的,因为通常 RGB 值一起用于绘制每个多边形。
Listing 4.5 shows an AoS where the data is laid out in memory (figure 4.6). In the figure, note the blank space at bytes 12, 28, and 44 where the compiler inserts padding to get the memory alignment on a 64-bit boundary (128 bits or 16 bytes). A 64-byte cache line holds four values of the structure. Then in line 6, we create the polygon_color array composed of 1,000 of the RGB data structure type. This data layout is reasonable because, generally, the RGB values are used together to draw each polygon.
图 4.6 结构数组 (AoS) 中 RGB 颜色模型的内存中布局。
Figure 4.6 Layout in memory of an RGB color model in an Array of Structures (AoS).
SoA 是一种替代数据布局。下面的清单显示了此代码的 C 代码。
The SoA is an alternative data layout. The following listing shows the C code for this.
Listing 4.6 Structure of Arrays (SoA) in C
1 struct RGB {
2 int *R; ❶
3 int *G; ❶
4 int *B; ❶
5 };
6 struct RGB polygon_color; ❷
7
8 polygon_color.R = (int *)malloc(1000*sizeof(int));
9 polygon_color.G = (int *)malloc(1000*sizeof(int));
10 polygon_color.B = (int *)malloc(1000*sizeof(int));
11
12 free(polygon_color.R);
13 free(polygon_color.G);
14 free(polygon_color.B); 1 struct RGB {
2 int *R; ❶
3 int *G; ❶
4 int *B; ❶
5 };
6 struct RGB polygon_color; ❷
7
8 polygon_color.R = (int *)malloc(1000*sizeof(int));
9 polygon_color.G = (int *)malloc(1000*sizeof(int));
10 polygon_color.B = (int *)malloc(1000*sizeof(int));
11
12 free(polygon_color.R);
13 free(polygon_color.G);
14 free(polygon_color.B);
❶ Defines an integer array of a color value
❷ Defines a Structure of Arrays (SoA)
内存布局在连续内存中具有所有 1,000 个 R 值。G 和 B 颜色值可以跟随内存中的 R 值,但这些值也可以位于堆中的其他位置,具体取决于内存分配器查找空间的位置。堆是一个单独的内存区域,用于使用 malloc 例程或 new 运算符分配动态内存。我们还可以使用 contiguous memory allocator(清单 4.3)来强制将内存放在一起。
The memory layout has all 1,000 R values in contiguous memory. The G and B color values could follow the R values in memory, but these can also be elsewhere in the heap, depending on where the memory allocator finds space. The heap is a separate region of memory that is used to allocate dynamic memory with the malloc routine or the new operator. We can also use the contiguous memory allocator (listing 4.3) to force the memory to be located together.
我们在这里关注的是性能。从程序员的角度来看,这些数据结构中的每一种使用都同样合理,但重要的问题是数据结构对 CPU 的显示方式以及它如何影响性能。让我们看看这些数据结构在几种不同场景中的性能。
Our concern here is performance. Each of these data structures is equally reasonable to use from the programmer’s perspective, but the important questions are how does the data structure appear to the CPU and how does it affect performance. Let’s look at the performance of these data structures in a couple of different scenarios.
Array of Structures (AoS) performance assessment
在我们的颜色示例中,假设在读取数据时,访问了一个点的所有三个分量,但访问了单个 R、G 或 B 值,因此 AoS 表示效果很好。对于图形操作,通常使用这种数据布局。
In our color example, assume that when the data is read, all three components for a point are accessed but not a single R, G, or B value, so the AoS representation works well. And for graphics operations, this data layout is commonly used.
注意如果编译器添加填充,则会将 AoS 表示的内存加载数量增加 25%,但并非所有编译器都插入此填充。对于那些这样做的编译器来说,仍然值得考虑。
Note If the compiler adds padding, it increases the number of memory loads by 25% for the AoS representation, but not all compilers insert this padding. Still it is worth considering for those compilers that do.
如果在循环中只访问其中一个 RGB 值,则缓存使用率会很差,因为循环会跳过不需要的值。当编译器对这种访问模式进行向量化时,它需要使用效率较低的 gather/scatter 操作。
If only one of the RGB values is accessed in a loop, the cache usage would be poor because the loop skips over unneeded values. When this access pattern is vectorized by the compiler, it would need to use a less efficient gather/scatter operation.
Structure of Arrays (SoA) performance assessment
对于 SoA 布局,RGB 值具有单独的缓存行(图 4.7)。因此,对于需要所有三个 RGB 值的小数据大小,缓存使用率很高。但是,随着数组越来越大,数组越来越多,缓存系统就会陷入困境,从而导致性能受到影响。在这些情况下,数据和缓存的交互变得过于复杂,无法完全预测性能。
For the SoA layout, the RGB values have separate cache lines (figure 4.7). Thus, for small data sizes where all three RGB values are needed, there’s good cache usage. But as the arrays grow larger and more arrays are presented, the cache system struggles, causing performance to suffer. In these cases, the interactions of the data and the cache become too complicated to fully predict the performance.
图 4.7 在数组结构 (SoA) 数据布局中,指针在内存中相邻,指向每种颜色的单独连续数组。
Figure 4.7 In the Structure of Arrays (SoA) data layout, the pointers are adjacent in memory, pointing to separate contiguous arrays for each color.
经常遇到的另一种数据布局和访问模式是在计算应用程序中使用变量作为 3D 空间坐标。下面的清单显示了典型的 C 结构定义。
Another data layout and access pattern that is often encountered is the use of variables as 3D spatial coordinates in a computational application. The following listing shows the typical C structure definition for this.
Listing 4.7 Spatial coordinates in a C Array of Structures (AoS)
1 struct point {
2 double x, y, z; ❶
3 };
4 struct point cell[1000]; ❷
5 double radius[1000];
6 double density[1000];
7 double density_gradient[1000];1 struct point {
2 double x, y, z; ❶
3 };
4 struct point cell[1000]; ❷
5 double radius[1000];
6 double density[1000];
7 double density_gradient[1000];
❶ Defines the spatial coordinate of point
❷ Defines an array of point locations
此数据结构的一个用途是计算距原点 (radius) 的距离,如下所示:
One use of this data structure is to calculate the distance from the origin (radius) as follows:
10 for (int i=0; i < 1000; i++){
11 radius[i] = sqrt(cell[i].x*cell[i].x + cell[i].y*cell[i].y + cell[i].z*cell[i].z);
12 }10 for (int i=0; i < 1000; i++){
11 radius[i] = sqrt(cell[i].x*cell[i].x + cell[i].y*cell[i].y + cell[i].z*cell[i].z);
12 }
x、y 和 z 的值一起放在一个缓存行中,并写出到第二个缓存行中的 radius 变量。这种情况的缓存使用是合理的。但在第二种可能的情况下,计算循环可能会使用 x 位置来计算 x 方向上的密度梯度,如下所示:
The values of x, y, and z are brought in together in one cache line and written out to the radius variable in a second cache line. The cache usage for this case is reasonable. But in a second plausible case, a computational loop might use the x location to calculate a gradient in density in the x-direction like this:
20 for (int i=1; i < 1000; i++){
21 density_gradient[i] = (density[i] - density[i-1])/
(cell[i].x - cell[i-1].x);
22 }20 for (int i=1; i < 1000; i++){
21 density_gradient[i] = (density[i] - density[i-1])/
(cell[i].x - cell[i-1].x);
22 }
现在,x 的缓存访问跳过 y 和 z 数据,因此仅使用缓存中数据的三分之一(如果填充,则为四分之一)。因此,最佳数据布局完全取决于使用情况和特定的数据访问模式。
Now the cache access for x skips over the y and z data so that only one-third (or even one-quarter if padded) of the data in the cache is used. Thus, the optimal data layout depends entirely on usage and the particular data access patterns.
在实际应用程序中可能出现的混合用例中,有时结构变量一起使用,有时则不一起使用。通常,AoS 布局在 CPU 上的整体性能更好,而 SoA 布局在 GPU 上的性能更好。在报告的结果中,存在足够的可变性,因此值得针对特定使用模式进行测试。在密度梯度情况下,下面的清单显示了 SoA 代码。
In mixed use cases, which are likely to appear in real applications, sometimes the structure variables are used together and sometimes not. Generally, the AoS layout performs better overall on CPUs, while the SoA layout performs better on GPUs. In reported results, there is enough variability that it is worth testing for a particular usage pattern. In the density gradient case, the following listing shows the SoA code.
Listing 4.8 Spatial coordinate Structure of Arrays (SoA)
1 struct point{
2 double *x, *y, *z; ❶
3 };
4 struct point cell; ❷
5 cell.x = (double *)malloc(1000*sizeof(double));
6 cell.y = (double *)malloc(1000*sizeof(double));
7 cell.z = (double *)malloc(1000*sizeof(double));
8 double *radius = (double *)malloc(1000*sizeof(double));
9 double *density = (double *)malloc(1000*sizeof(double));
10 double *density_gradient = (double *)malloc(1000*sizeof(double));
11 // ... initialize data
12
13 for (int i=0; i < 1000; i++){ ❸
14 radius[i] = sqrt(cell.x[i]*cell.x[i] +
cell.y[i]*cell.y[i] +
cell.z[i]*cell.z[i]);
15 }
16
17 for (int i=1; i < 1000; i++){ ❹
18 density_gradient[i] = (density[i] - density[i-1])/
(cell.x[i] - cell.x[i-1]);
19 }
20
21 free(cell.x);
22 free(cell.y);
23 free(cell.z);
24 free(radius);
25 free(density);
26 free(density_gradient); 1 struct point{
2 double *x, *y, *z; ❶
3 };
4 struct point cell; ❷
5 cell.x = (double *)malloc(1000*sizeof(double));
6 cell.y = (double *)malloc(1000*sizeof(double));
7 cell.z = (double *)malloc(1000*sizeof(double));
8 double *radius = (double *)malloc(1000*sizeof(double));
9 double *density = (double *)malloc(1000*sizeof(double));
10 double *density_gradient = (double *)malloc(1000*sizeof(double));
11 // ... initialize data
12
13 for (int i=0; i < 1000; i++){ ❸
14 radius[i] = sqrt(cell.x[i]*cell.x[i] +
cell.y[i]*cell.y[i] +
cell.z[i]*cell.z[i]);
15 }
16
17 for (int i=1; i < 1000; i++){ ❹
18 density_gradient[i] = (density[i] - density[i-1])/
(cell.x[i] - cell.x[i-1]);
19 }
20
21 free(cell.x);
22 free(cell.y);
23 free(cell.z);
24 free(radius);
25 free(density);
26 free(density_gradient);
❶ Defines arrays of spatial locations
❷ Defines structure of cell spatial locations
❸ This loop uses contiguous values of arrays.
❹ This loop uses contiguous values of arrays.
在这种数据布局中,每个变量都被引入到单独的 cache 行中,并且 cache 的使用对两个 kernel 都有好处。但是,随着所需数据成员的数量变得足够大,缓存很难有效地处理大量内存流。在 C++ 面向对象的实现中,您应该警惕其他陷阱。下一个清单显示了一个单元类,其中单元空间坐标和半径作为其数据组件,以及一种从 x、y 和 z 计算半径的方法。
With this data layout, each variable is brought in on a separate cache line, and cache usage will be good for both kernels. But as the number of required data members get sufficiently larger, the cache has difficulty efficiently handling the multitude of memory streams. In a C++ object-oriented implementation, you should be wary of other pitfalls. The next listing presents a cell class with the cell spatial coordinate and the radius as its data components and a method to calculate the radius from x, y, and z.
Listing 4.9 Spatial coordinate class example with C++
1 class Cell{
2 double x;
3 double y;
4 double z;
5 double radius;
6 public:
7 void calc_radius() {
radius = sqrt(x*x + y*y + z*z); ❶
}
8 void big_calc();
9 }
10
11 Cell my_cells[1000]; ❷
12
13 for (int i = 0; i < 1000; i++){
14 my_cells[i].calc_radius();
15 }
16
17 void Cell::big_calc(){
18 radius = sqrt(x*x + y*y + z*z);
19 // ... lots more code, preventing in-lining
20 } 1 class Cell{
2 double x;
3 double y;
4 double z;
5 double radius;
6 public:
7 void calc_radius() {
radius = sqrt(x*x + y*y + z*z); ❶
}
8 void big_calc();
9 }
10
11 Cell my_cells[1000]; ❷
12
13 for (int i = 0; i < 1000; i++){
14 my_cells[i].calc_radius();
15 }
16
17 void Cell::big_calc(){
18 radius = sqrt(x*x + y*y + z*z);
19 // ... lots more code, preventing in-lining
20 }
❶ Invokes radius function for each cell
❷ Defines an array of objects as an array of structs
运行此代码会导致每个单元的子例程调用产生一些指令缓存未命中和开销。当指令序列跳转并且下一条指令不在指令缓存中时,就会发生指令高速缓存未命中。有两个 1 级缓存:一个用于程序数据,另一个用于处理器的指令。子例程调用需要额外的开销,以便在调用和指令跳转之前将参数推送到堆栈上。进入例程后,需要将参数从堆栈中弹出,然后在例程结束时,进行另一次指令跳转。在这种情况下,代码足够简单,编译器可以内联例程以避免这些成本。但在更复杂的情况下,例如使用 big_calc 例程,则不能。此外,缓存行还提取 x、y、z 和 radius。缓存有助于加快实际需要读取的位置坐标的加载速度。但是需要写入的半径也在 cache 行中。如果不同的处理器正在写入 radius 的值,这可能会使缓存行失效,并要求其他处理器将数据重新加载到其缓存中。
Running this code results in a couple of instruction cache misses and overhead from subroutine calls for each cell. Instruction cache misses occur when the sequence of instructions jumps and the next instruction is not in the instruction cache. There are two level 1 caches: one for the program data and the second for the processor’s instructions. Subroutine calls require the additional overhead to push the arguments onto the stack before the call and an instruction jump. Once in the routine, the arguments need to be popped off the stack and then, at the end of the routine, there is another instruction jump. In this case, the code is simple enough that the compiler can inline the routine to avoid these costs. But in more complex cases, such as with a big_calc routine, it cannot. Additionally, the cache line pulls in x, y, z, and the radius. The cache helps speed up the load of the position coordinates that actually need to be read. But the radius, which needs to be written, is also in the cache line. If different processors are writing the values for the radius, this could invalidate the cache lines and require other processors to reload the data into their caches.
C++ 的许多功能使编程更容易。这些通常应该在代码的更高级别使用,使用更简单的 C 和 Fortran 过程样式,其中性能很重要。在前面的清单中,半径计算可以作为数组而不是单个标量元素来完成。类指针可以在例程开始时取消引用一次,以避免重复取消引用和可能的指令高速缓存未命中。取消引用是一种操作,其中从指针引用获取内存地址,以便缓存行专用于内存数据而不是指针。简单哈希表还可以使用一个结构将键和值分组在一起,如下面的清单所示。
There are many features of C++ that make programming easier. These should generally be used at a higher level in the code, using the simpler procedural style of C and Fortran where performance counts. In the previous listing, the radius calculation can be done as an array instead of as a single scalar element. The class pointer can be dereferenced once at the start of the routine to avoid repeated dereferencing and possible instruction cache misses. Dereferencing is an operation where the memory address is obtained from the pointer reference so that the cache line is dedicated to the memory data instead of the pointer. Simple hash tables can also use a structure to group the key and value together as the following listing shows.
Listing 4.10 Hash Array of Structures (AoS)
1 struct hash_type {
2 int key;
3 int value;
4 };
5 struct hash_type hash[1000];1 struct hash_type {
2 int key;
3 int value;
4 };
5 struct hash_type hash[1000];
此代码的问题在于,它会读取多个键,直到找到匹配的键,然后读取该键的值。但是 key 和 value 被放入单个 cache 行中,并且在匹配发生之前忽略该值。最好将 key 作为一个数组,将 value 作为另一个数组,以便更快地搜索 key,如下一个清单所示。
The problem with this code is that it reads multiple keys until it finds one that matches and then reads the value for that key. But the key and value are brought into a single cache line, and the value ignored until the match occurs. It is better to have the key as one array and the value as another to facilitate a faster search through the keys as shown in the next listing.
Listing 4.11 Hash Structure of Arrays (SoA)
1 struct hash_type {
2 int *key;
3 int *value;
4 } hash;
5 hash.key = (int *)malloc(1000*sizeof(int));
6 hash.value = (int *)malloc(1000*sizeof(int));1 struct hash_type {
2 int *key;
3 int *value;
4 } hash;
5 hash.key = (int *)malloc(1000*sizeof(int));
6 hash.value = (int *)malloc(1000*sizeof(int));
最后一个例子,以包含密度、3D 动量和总能量的物理状态结构为例。下面的清单显示了这种结构。
As a final example, take a physics state structure that contains density, 3D momentum, and total energy. The following listing shows this structure.
Listing 4.12 Physics state Array of Structures (AoS)
1 struct phys_state {
2 double density;
3 double momentum[3];
4 double TotEnergy;
5 };1 struct phys_state {
2 double density;
3 double momentum[3];
4 double TotEnergy;
5 };
当仅处理 density 时,cache 中接下来的四个值将未使用。同样,最好将其作为 SoA。
When processing only density, the next four values in cache go unused. Again, it is better to have this as an SoA.
在某些情况下,结构和数组的混合分组是有效的。数组结构数组 (AoSoA) 可用于将数据“平铺”为向量长度。让我们引入符号 A[len/4]S[3]A[4] 来表示此布局。A[4] 是一个由四个数据元素组成的数组,是内部的连续数据块。S[3] 表示三个字段的数据结构的下一个级别。S[3]A[4] 的组合给出了图 4.8 所示的数据布局。
There are cases where hybrid groupings of structures and arrays are effective. The Array of Structures of Arrays (AoSoA) can be used to “tile” the data into vector lengths. Let’s introduce the notation A[len/4]S[3]A[4] to represent this layout. A[4] is an array of four data elements and is the inner, contiguous block of data. S[3] represents the next level of the data structure of three fields. The combination of S[3]A[4] gives the data layout that figure 4.8 shows.
图 4.8 数组结构数组 (AoSoA) 与最后一个数组长度一起使用,与硬件的向量长度匹配,向量长度为 4。
Figure 4.8 An Array of Structures of Arrays (AoSoA) is used with the last array length, matching the vector length of the hardware for a vector length of four.
我们需要重复 12 个数据值块 A[len/4] 次才能获得所有数据。如果我们用变量替换 4,我们会得到
We need to repeat the block of 12 data values A[len/4] times to get all the data. If we replace the 4 with a variable, we get
A[len/V]S[3]A[V], where V=4
A[len/V]S[3]A[V], where V=4
In C or Fortran, respectively, the array could be dimensioned as
var[len/V][3][V], var(1:V,1:3,1:len/V)
var[len/V][3][V], var(1:V,1:3,1:len/V)
In C++, this would be implemented naturally as the following listing shows.
Listing 4.13 RGB Array of Structures of Arrays (AoSoA)
1 const int V=4; ❶ 2 struct SoA_type{ 3 int R[V], G[V], B[V]; 4 }; 5 6 int main(int argc, char *argv[]) 7 { 8 int len=1000; 9 struct SoA_type AoSoA[len/V]; ❷ 10 11 for (int j=0; j<len/V; j++){ ❸ 12 for (int i=0; i<V; i++){ ❹ 13 AoSoA[j].R[i] = 0; 14 AoSoA[j].G[i] = 0; 15 AoSoA[j].B[i] = 0; 16 } 17 } 18 }
1 const int V=4; ❶ 2 struct SoA_type{ 3 int R[V], G[V], B[V]; 4 }; 5 6 int main(int argc, char *argv[]) 7 { 8 int len=1000; 9 struct SoA_type AoSoA[len/V]; ❷ 10 11 for (int j=0; j<len/V; j++){ ❸ 12 for (int i=0; i<V; i++){ ❹ 13 AoSoA[j].R[i] = 0; 14 AoSoA[j].G[i] = 0; 15 AoSoA[j].B[i] = 0; 16 } 17 } 18 }
❷ Divides the array length by vector length
❹ Loops over vector length, which should vectorize
通过改变 V 以匹配硬件向量长度或 GPU 工作组大小,我们创建了一个可移植的数据抽象。此外,通过定义 V=1 或 V=len,我们分别恢复 AoS 和 SoA 数据结构。然后,此数据布局成为适应硬件和程序的数据使用模式的一种方式。
By varying V to match the hardware vector length or the GPU work group size, we create a portable data abstraction. In addition, by defining V=1 or V=len, we recover the AoS and SoA data structures, respectively. This data layout then becomes a way to adapt for the hardware and the program’s data use patterns.
关于此数据结构的实现,有许多细节需要解决,以最大限度地降低索引成本并决定是否填充数组以提高性能。AoSoA 数据布局具有 AoS 和 SoA 数据结构的一些属性,因此性能通常接近两者中更好的一个,如洛斯阿拉莫斯国家实验室的 Robert Bird 的研究所示(图 4.9)。
There are many details to address about the implementation of this data structure to minimize indexing costs and decide whether to pad the array for performance. The AoSoA data layout has some of the properties of both the AoS and SoA data structures so the performance is generally close to the better of the two as shown in a study by Robert Bird from Los Alamos National Laboratory (figure 4.9).
图 4.9 数组结构阵列 (AoSoA) 的性能通常与 AoS 和 SoA 的最佳性能相匹配。x 轴图例中的 1、8 和 NP 数组长度是 AoSoA 中最后一个数组的值。这些值意味着第一组减少为 AOS,最后一组减少为 SoA,中间集的第二个数组长度为 8 以匹配处理器的向量长度。
Figure 4.9 Performance of the Array of Structures of Arrays (AoSoA) generally matches the best of the AoS and SoA performances. The 1, 8 and NP array length in the x-axis legend is the value for the last array in AoSoA. These values mean that the first set reduces to an AOS, the last set reduces to an SoA, and the middle set has a second array length of 8 to match the vector length of the processor.
缓存效率在密集计算的性能中占主导地位。只要数据被缓存,计算就会快速进行。当数据未缓存时,会发生缓存未命中。然后,处理器必须暂停并等待数据加载。缓存未命中的成本约为 100 到 400 个周期;可以在同一时间内完成 100 次失败!为了提高性能,我们必须最大限度地减少缓存未命中。但是,最大限度地减少缓存未命中需要了解数据如何从主内存移动到 CPU。这是通过一个简单的性能模型完成的,该模型将缓存未命中分为三个 C:强制、容量和冲突。首先,我们必须了解缓存的工作原理。
Cache efficiency dominates the performance of intensive computations. As long as the data is cached, the computation proceeds quickly. When the data is not cached, a cache miss occurs. The processor then has to pause and wait for the data to be loaded. The cost of a cache miss is on the order of 100 to 400 cycles; 100s of flops can be done in the same time! For performance, we must minimize cache misses. But minimizing cache misses requires an understanding of how data moves from main memory to the CPU. This is done with a simple performance model that separates cache misses into three C’s: compulsory, capacity, and conflict. First we must understand how the cache works.
加载数据时,数据被加载到块中,称为缓存行,通常为 64 字节长。然后,这些将根据其在内存中的地址插入到缓存位置。在直接映射缓存中,只有一个位置将数据加载到缓存中。当两个数组映射到同一位置时,这一点很重要。使用直接映射缓存时,一次只能缓存一个数组。为避免这种情况,大多数处理器都有一个 N 路集关联高速缓存,该高速缓存提供将数据加载到的 N 个位置。通过对大型数组进行定期、可预测的内存访问,可以预取数据。也就是说,您可以发出指令以在需要数据之前预加载数据,以便它已经在缓存中。这可以通过编译器在硬件或软件中完成。
When data is loaded, it is loaded in blocks, called cache lines, that are typically 64 bytes long. These are then inserted into a cache location based on its address in memory. In a direct-mapped cache, there is only one location to load data into the cache. This is important when two arrays get mapped to the same location. With a direct-mapped cache, only one array can be cached at a time. To avoid this, most processors have an N-way set associative cache that provides N locations into which data are loaded. With regular, predictable memory accesses of large arrays, it is possible to prefetch data. That is, you can issue an instruction to preload data before it is needed so that it’s already in the cache. This can be done either in hardware or in software by the compiler.
驱逐是指从一个或多个缓存级别中删除缓存行。这可能是由于同一位置的缓存行负载 (缓存冲突) 或缓存的大小有限 (容量未命中) 引起的。循环中赋值的 store 操作会导致 write allocate in cache,其中将创建和修改新的缓存行。此缓存行被逐出 (存储) 到主内存,尽管它可能不会立即发生。使用的各种写入策略会影响写入操作的详细信息。缓存的三个 C 是了解高速缓存未命中来源的简单方法,这些未命中在密集计算的运行时性能中占主导地位。
Eviction is the removal of a cache line from one or more cache levels. This can be caused by the load of a cache line at the same location (cache conflict) or the limited size of the cache (capacity miss). A store operation by an assignment in the loop causes a write allocate in cache, where a new cache line is created and modified. This cache line is evicted (stored) to main memory, although it may not happen immediately. There are various write policies used that affect the details of write operations. The three C’s of caches are a simple approach to understanding the source of the cache misses that dominate run-time performance for intensive computations.
Compulsory—Cache misses that are necessary to bring in the data when it is first encountered.
Capacity—Cache misses that are caused by a limited cache size, which evicts data from the cache to free up space for new cache line loads.
冲突 - 在缓存中的同一位置加载数据时。如果同时需要两个或多个数据项,但又映射到同一缓存行,则必须为每个数据元素访问重复加载这两个数据项。
Conflict—When data is loaded at the same location in the cache. If two or more data items are needed at the same time but are mapped to the same cache line, both data items must be loaded repeatedly for each data element access.
当由于容量或冲突驱逐而发生缓存未命中,然后重新加载缓存行时,这有时称为缓存抖动,这可能会导致性能不佳。从这些定义中,我们可以很容易地计算出内核的一些特征,至少可以了解预期的性能。为此,我们将使用图 1.10 中的 blur 运算符内核。
When cache misses occur due to capacity or conflict evictions followed by reloads of the cache lines, this is sometimes referred to as cache thrashing, which can lead to poor performance. From these definitions, we can easily calculate a few characteristics of a kernel and at least get an idea of the expected performance. For this, we will use the blur operator kernel from figure 1.10.
清单 4.14 显示了 stencil.c 内核。我们还使用 4.1.1 节中 malloc2D.c 中的 2D 连续内存分配例程。计时器代码未在此处显示,但位于联机源代码中。包括计时器和对 likwid (“Like I Know What I'm Doing”) 分析器的调用。在迭代之间,有一个对大型数组的写入以刷新缓存,以便其中没有可能扭曲结果的相关数据。
Listing 4.14 shows the stencil.c kernel. We also use the 2D contiguous memory allocation routine in malloc2D.c from section 4.1.1. The timer code is not shown here but is in the online source code. Included are timers and calls to the likwid (“Like I Knew What I’m Doing”) profiler. Between iterations, there is a write to a large array to flush the cache so that there is no relevant data in it that can distort the results.
列表 4.14 Krakatau blur 运算符的 Stencil 内核
Listing 4.14 Stencil kernel for the Krakatau blur operator
stencil.c
stencil.c
如果我们有一个非常有效的缓存,一旦数据被加载到内存中,它就会被保存在那里。当然,在大多数情况下,这与现实相去甚远。但是使用这个模型,我们可以计算以下内容:
If we have a perfectly effective cache, once the data is loaded into memory, it is kept there. Of course, this is far from reality in most cases. But with this model, we can calculate the following:
Total memory used = 2000 × 2000 × (5 references + 1 store) × 8 bytes = 192 MB
Compulsory memory loaded and stored = 2002 × 2002 × 8 bytes × 2 arrays = 64.1 MB
算术强度 = 5 × 2000 × 2000 / 64.1 MB = .312 flops/字节或 2.5 FLOPS/字
Arithmetic intensity = 5 flops × 2000 × 2000 / 64.1 Mbytes = .312 flops/byte or 2.5 flops/word
然后,该程序使用 likwid 库进行编译,并使用以下命令在 Skylake 6152 处理器上运行:
The program is then compiled with the likwid library and run on a Skylake 6152 processor with the following command:
likwid-perfctr -C 0 -g MEM_DP -m ./stencil
likwid-perfctr -C 0 -g MEM_DP -m ./stencil
The result that we need is at the end of the performance table printed at the conclusion of the run:
+-----------------------------------+------------+ | ... | | | DP MFLOP/s | 3923.4952 | | AVX DP MFLOP/s | 3923.4891 | | ... | | | Operational intensity | 0.247 | +-----------------------------------+------------+
+-----------------------------------+------------+ | ... | | | DP MFLOP/s | 3923.4952 | | AVX DP MFLOP/s | 3923.4891 | | ... | | | Operational intensity | 0.247 | +-----------------------------------+------------+
模板内核的性能数据使用 Python 脚本(在线材料中提供)以屋顶线图的形式呈现,如图 4.10 所示。如第 3.2.4 节所述,roofline 图显示了最大浮点运算的硬件限制和最大带宽作为算术强度的函数。
The performance data for the stencil kernel is presented as a roofline plot using a Python script (available in the online materials) and shown in figure 4.10. The roofline plot, as introduced in section 3.2.4, shows the hardware limits of the maximum floating-point operations and the maximum bandwidth as a function of arithmetic intensity.
图 4.10 第 1 章中 Krakatau 示例的模板内核的屋顶线图显示了测量性能右侧的强制上限。
Figure 4.10 The roofline plot of the stencil kernel for the Krakatau example in chapter 1 shows the compulsory upper bound to the right of the measured performance.
此屋顶线图显示了测得的算术强度 0.247 右侧的强制数据限制(在图 4.10 中用大点表示)。如果内核具有冷缓存,则内核的性能不能超过强制限制。冷缓存是指在进入内核之前执行的任何操作中没有任何相关数据的缓存。大点和强制限制之间的距离让我们了解缓存在这个内核中的有效性。在这种情况下,内核很简单,容量和冲突缓存负载仅比强制缓存负载高 15% 左右。因此,内核性能没有太大的改进空间。大点和 DRAM 屋顶线之间的距离是因为这是一个具有向量化的串行内核,而屋顶线与 OpenMP 平行。因此,有可能通过添加并行性来提高性能。
This roofline plot shows the compulsory data limit to the right side of the measured arithmetic intensity of 0.247 (shown with a large dot in figure 4.10). The kernel cannot do better than the compulsory limit if it has a cold cache. A cold cache is one that does not have any relevant data in it from whatever operations were being done before entering the kernel. The distance between the large dot and the compulsory limit gives us an idea of how effective the cache is in this kernel. The kernel in this case is simple, and the capacity and conflict cache loads are only about 15% greater than the compulsory cache loads. Thus, there is not much room for improvement for the kernel performance. The distance between the large dot and the DRAM roofline is because this is a serial kernel with vectorization, while the rooflines are parallel with OpenMP. Thus, there is a potential to improve performance by adding parallelism.
由于这是对数-对数图,因此差异比它们可能看起来的要大。仔细观察,并行性可能带来的改进几乎是一个数量级。可以通过在缓存行中使用其他值或在缓存中的数据时多次重用数据来提高缓存使用率。这是两种不同的情况,称为空间局部性或时间局部性:
Because this is a log-log plot, differences are greater than they might appear. Looking closely, the possible improvement from parallelism is nearly an order of magnitude. Improving cache usage can be accomplished by using other values in the cache line or reusing data multiple times while it is in the cache. These are two different cases, referred to as either spatial locality or temporal locality:
Spatial locality refers to data with nearby locations in memory that are often referenced close together.
Temporal locality refers to recently referenced data that is likely to be referenced again in the near future.
对于 stencil 内核(列表 4.14),当 x[1][1] 的值被放入缓存时,x[1][2] 也被放入缓存中。这就是空间局部性。在循环的下一次迭代中,要计算 x[1][2],需要 x[1][1]。它应该仍然在缓存中,并作为临时位置的情况被重用。
For the stencil kernel (listing 4.14), when the value of x[1][1] is brought into cache, x[1][2] is also brought into cache. This is spatial locality. In the next iteration of the loop to calculate x[1][2], x[1][1] is needed. It should still be in the cache and gets reused as a case of temporal locality.
第四个 C 经常被添加到前面提到的三个 C 中,这将在后面的章节中变得很重要。这称为 coherency。
A fourth C is often added to the three C’s mentioned earlier that will become important in later chapters. This is called coherency.
定义 一致性适用于当写入一个处理器的高速缓存的数据也保存在另一个处理器的高速缓存中时,在多处理器之间同步高速缓存所需的高速缓存更新。
Definition Coherency applies to those cache updates needed to synchronize the cache between multiprocessors when data that is written to one processor’s cache is also held in another processor’s cache.
保持一致性所需的高速缓存更新有时会导致内存总线上的流量过大,有时称为高速缓存更新风暴。当向并行作业添加其他处理器时,这些缓存更新风暴可能会导致性能下降,而不是加速。
The cache updates required to maintain coherency can sometimes lead to heavy traffic on the memory bus and are sometimes referred to as cache update storms. These cache update storms can lead to slowdowns in performance rather than speedups when additional processors are added to a parallel job.
本节将介绍一个示例,该示例使用简单的性能模型来决定在物理场应用中使用哪种数据结构进行多材料计算。它使用一个真实的案例研究来展示以下因素的影响:
This section looks at an example of using simple performance models to make informed decisions on what data structure to use for multi-material calculations in a physics application. It uses a real case study to show the effects of:
Simple performance models for a real programming design question
Compressed sparse data structures to stretch your computational resources
计算科学的某些部分长期以来一直使用压缩的稀疏矩阵表示。最值得注意的是自 1960 年代中期以来用于稀疏矩阵的压缩稀疏行 (CSR) 格式,效果很好。对于本案例研究中评估的压缩稀疏数据结构,内存节省大于 95%,并且运行时间比简单的 2D 数组设计快 90%。使用的简单性能模型预测的性能与实际测量性能的误差在 20-30% 以内(参见本章后面的附加阅读部分的 Fogerty、Mattineau 等人)。但是使用这种压缩方案是有成本的 — 程序员的工作量。我们希望使用压缩的稀疏数据结构,其好处大于成本。做出这个决定是 Simple Performance Model 真正显示出其有用性的地方。
Some segments of computational science have long used compressed sparse matrix representations. Most notable is the Compressed Sparse Row (CSR) format used for sparse matrices since the mid 1960s with great results. For the compressed sparse data structure evaluated in this case study, the memory savings are greater than 95%, and the run time approaches 90% faster than the simple 2D array design. The simple performance models used predicted the performance within a 20-30% error of actual measured performance (see Fogerty, Mattineau, et al., in the section on additional reading later in this chapter). But there is a cost to using this compressed scheme—programmer effort. We want to use the compressed sparse data structure where its benefits outweigh the costs. Making this decision is where the simple performance model really shows its usefulness.
在解决更复杂的编程问题时,简单的性能模型对应用程序开发人员非常有用,而不仅仅是 2D 数组上的双嵌套循环。这些模型的目标是通过特征内核中的简单运算计数来粗略评估性能,从而做出有关编程替代方案的决策。简单的性能模型比三个 C 模型稍微复杂一些。基本过程是计数并记下以下内容:
Simple performance models are useful to the application developer when addressing more complex programming problems than just a doubly-nested loop over a 2D array. The goal of these models is to get a rough assessment of performance through simple counts of operations in a characteristic kernel to make decisions on programming alternatives. Simple performance models are slightly more complicated than the three C’s model. The basic process is to count and note the following:
我们将计算内存加载和存储(统称为 memops)和 flops,但我们也会注意内存加载是否连续,以及是否存在可能影响性能的分支。我们还将使用经验数据(如流带宽和广义操作计数)将这些计数转换为性能估计值。如果内存负载不是连续的,则仅使用 cache 行中 8 个值中的 1 个,因此在这些情况下,我们将流带宽最多除以 8。
We’ll count memory loads and stores (collectively referred to as memops) and flops, but we’ll also note whether the memory loads are contiguous and if there are branches that might affect the performance. We’ll also use empirical data such as stream bandwidth and generalized operation counts to transform the counts into performance estimates. If the memory loads are not contiguous, only 1 out of 8 values in the cache line are used, so we divide the stream bandwidth by up to 8 for those cases.
对于本研究的序列部分,我们将使用具有 6 MB L3 缓存的 MacBook Pro 的硬件性能。处理器频率 (v) 为 2.7 GHz。使用第 3.2.4 节中介绍的技术以及流基准测试代码,测得的流带宽为 13375 MB/s。
For the serial part of this study, we’ll use the hardware performance of a MacBook Pro with 6 MB L3 cache. The processor frequency (v) is 2.7 GHz. The measured stream bandwidth is 13,375 MB/s using the technique introduced in section 3.2.4 with the stream benchmark code.
在具有分支的算法中,如果我们几乎一直采用分支,则分支成本较低。当所采用的分支不频繁时,我们会添加分支预测成本 (Bc) 和可能的错过预取成本 (Pc)。分支预测器的简单模型使用最近几次迭代中最常见的情况作为可能的路径。如果由于数据局部性而在分支路径中存在一些集群,这将降低成本。分支惩罚 (B p) 变为 NbB f(Bc + Pc)/v。对于典型的架构,分支预测成本 (Bc) 约为 16 个周期,而缺失预取成本 (PC) 根据经验确定约为 112 个周期。Nb 是遇到分支的次数,Bf 是分支未命中频率。未知长度的小循环的循环开销也被分配了一个成本 (Lc ) 来考虑分支和控制。循环成本估计为每次出口约 20 次循环。环路惩罚 (Lp) 变为 Lc /v。
In algorithms with branching, if we take the branch almost all the time, the branch cost is low. When the branch taken is infrequent, we add a branch prediction cost (Bc) and possibly a missed prefetch cost (Pc). A simple model of the branch predictor uses the most frequent case in the last few iterations as the likely path. This lowers the cost if there is some clustering in branch paths due to data locality. The branch penalty (B p) becomes NbB f(Bc + Pc)/v. For typical architectures, the branch prediction cost (Bc) is about 16 cycles and the missing prefetch cost (Pc ) is empirically determined to be about 112 cycles. Nb is the number of times the branch is encountered and Bf is the branch miss frequency. Loop overhead for small loops of unknown length are also assigned a cost (Lc ) to account for branching and control. The loop cost is estimated at about 20 cycles per exit. The loop penalty (Lp) becomes Lc /v.
我们将在设计研究中使用简单的性能模型,研究用于物理仿真的可能多材料数据结构。本设计研究的目的是在编写任何代码之前确定哪些数据结构将提供最佳性能。过去,选择是基于主观判断而不是客观基础。正在研究的特殊情况是稀疏情况,其中计算网格中有许多材料,但任何计算单元中只有一种或很少的材料。在讨论可能的数据布局时,我们将参考图 4.11 中具有四种材质的小样本网格。其中 3 个单元只有一种材料,而 7 单元有四种材料。
We will use simple performance models in a design study looking at possible multi-material data structures for physics simulations. The purpose of this design study is to determine which data structures would give the best performance before writing any code. In the past, the choice was made on subjective judgement rather than an objective basis. The particular case that is being examined is the sparse case, where there are many materials in the computational mesh but only one or few materials in any computational cell. We’ll reference the small sample mesh with four materials in figure 4.11 in the discussion of possible data layouts. Three of the cells have only a single material, whereas cell 7 has four materials.
Figure 4.11 A 3×3 computational mesh shows that cell 7 contains four materials.
数据结构只是故事的一半。我们还需要通过以下方式评估几个代表性内核中的数据布局
The data structure is only half the story. We also need to evaluate the data layout in a couple of representative kernels by
Computing pavg[C], the average density of materials in cells of a mesh
Evaluating p[C][m], the pressure in each material contained in each cell using the ideal gas law: p(p,t) = nrt/v
这两种计算的算术强度都是每个单词 1 flop 或更低。我们还预计这些内核将受到带宽限制。我们将使用两个大型数据集来测试内核的性能。两者都是 50 个材料 (Nm)、100 万个单元问题 (Nc),具有四个状态阵列 (Nv)。状态数组是密度 (p)、温度 (t)、压力 (p) 和体积分数 (VF)。这两个数据集是
Both of these computations have an arithmetic intensity of 1 flop per word or lower. We also expect that these kernels will be bandwidth limited. We’ll use two large data sets to test the performance of the kernels. Both are 50 material (Nm), 1 million cell problems (Nc) with four state arrays (Nv). The state arrays are density (p), temperature (t), pressure (p), and volume fraction (Vf). The two data sets are
几何形状问题 - 从材料的嵌套矩形初始化的网格(图 4.12)。网格是一个规则的矩形网格。由于材质位于单独的矩形中,而不是分散的,因此大多数单元只有一个或两个材质。结果是有 95% 的纯细胞 (Pf) 和 5% 的混合细胞 (Mf)。此网格具有一些数据局部性,因此分支预测未命中 (Bp) 粗略估计为 0.7。
Geometric Shapes Problem—A mesh initialized from nested rectangles of materials (figure 4.12). The mesh is a regular rectangular grid. With the materials in separate rectangles rather than scattered, most cells only have one or two materials. The result is that there are 95% pure cells (Pf) and 5% mixed cells (Mf ). This mesh has some data locality so the branch prediction miss (Bp) is roughly estimated to be 0.7.
图 4.12 用于初始化几何形状测试用例的网格的 50 个嵌套半矩形
Figure 4.12 Fifty nested half rectangles used to initialize mesh for the geometric shapes test case
随机初始化问题 - 具有 80% 纯单元和 20% 混合单元的随机初始化网格。由于数据局部性很小,因此分支预测未命中 (B p) 估计为 1.0。
Randomly Initialized Problem—A randomly initialized mesh with 80% pure cells and 20% mixed cells. Because there is little data locality, the branch prediction miss (B p) is estimated to be 1.0.
在 4.3.1 和 4.3.2 节的性能分析中,有两个主要的设计考虑因素:数据布局和循环顺序。我们将数据布局称为以单元为中心或以材料为中心,具体取决于数据中较大的组织因子。数据布局因子在数据顺序中具有较大的步幅。我们将 loop 访问模式称为 cell- 或 material-dominant 来指示哪个是外部循环。当数据布局与循环访问模式一致时,会出现最佳情况。没有完美的解决方案;其中一个内核首选一种布局,而第二个内核首选另一种布局。
In the performance analysis in sections 4.3.1 and 4.3.2, there are two major design considerations: data layout and loop order. We refer to the data layout as either cell- or material-centric, depending on the larger organizing factor in the data. The data layout factor has the large stride in the data order. We refer to the loop access pattern as either cell- or material-dominant to indicate which is the outer loop. The best situation occurs when the data layout is consistent with the loop access pattern. There is no perfect solution; one of the kernels prefers one layout and the second kernel prefers the other.
最简单的数据结构是全矩阵存储表示形式。这假设每种材料都位于每个单元中。这些全矩阵表示类似于上一节中讨论的 2D 数组。
The simplest data structure is a full matrix storage representation. This assumes that every material is in every cell. These full matrix representations are similar to the 2D arrays discussed in the previous section.
Full matrix cell-centric storage
对于图 4.11 中的小问题(3x3 计算网格),图 4.13 显示了以单元为中心的数据布局。数据顺序遵循 C 语言约定,每个单元格的材料连续存储。换句话说,编程表示是可变的[C][m],其中 m 变化最快。在图中,着色元素是单元中的混合材质。纯细胞只有一个 1.0 条目。带破折号的单元表示该材料都不在单元中,因此在此表示中被赋予零。在这个简单的示例中,大约一半的条目具有零,但在更大的问题中,零的条目数将大于 95%。非零条目的数量称为填充分数 (F f ),在我们的设计场景中通常小于 5%。因此,如果使用压缩的稀疏存储方案,则内存节省将大于 95%,即使考虑到更复杂数据结构的额外存储开销也是如此。
For the small problem in figure 4.11 (the 3x3 computational mesh), figure 4.13 shows the cell-centric data layout. The data order follows the C language convention with the materials stored contiguously for each cell. In other words, the programming representation is variable[C][m] with m varying fastest. In the figure, the shaded elements are mixed materials in a cell. Pure cells just have a 1.0 entry. The elements with dashes indicate that none of that material is in the cell and is therefore given a zero in this representation. In this simple example, about half of the entries have zeros, but in the bigger problem, the number of entries that are zero will be greater than 95%. The number of non-zero entries is referred to as the filled fraction (F f ), and for our design scenario is typically less than 5%. Thus, if a compressed sparse storage scheme is used, the memory savings will be greater than 95%, even accounting for the additional storage overhead of the more complex data structures.
图 4.13 以单元为中心的全矩阵数据结构,每个单元的材料连续存储
Figure 4.13 The cell-centric, full matrix data structure with materials stored contiguously for each cell
全矩阵数据方法的优点是它更简单,因此更容易并行化和优化。节省的内存足够多,因此可能值得使用压缩的稀疏数据结构。但是该方法对性能有什么影响呢?我们可以猜测,为数据提供更多内存可能会增加内存带宽,并使完整矩阵表示变慢。但是,如果我们测试体积分数,如果体积分数为零,则跳过混合材料访问呢?图 4.14 显示了我们如何测试这种方法,其中 cell 主导循环的伪代码以及代码行左侧显示每个操作的计数。单元格主导的循环结构在外部循环中具有单元格索引,该索引与单元格索引匹配,作为以单元格为中心的数据结构中的第一个索引。
The full matrix data approach has the advantage that it is simpler and, thus, easier to parallelize and optimize. The memory savings is substantial enough that it is probably worth using the compressed sparse data structure. But what are the performance implications of the method? We can guess that having more memory for data potentially increases the memory bandwidth and makes the full matrix representation slower. But what if we test for the volume fraction and, if it is zero, we skip the mixed material access? Figure 4.14 shows how we tested this approach, where the pseudo-code for the cell-dominant loop is shown along with the counts for each operation to the left of the line of code. The cell-dominant loop structure has the cell index in the outer loop, which matches with the cell index as the first index in the cell-centric data structure.
图 4.14 修改后的单元主导算法,使用完整矩阵存储计算单元的平均密度
Figure 4.14 Modified cell-dominant algorithm to compute average density of cells using the full matrix storage
The counts are summarized from the line notes (beginning with #) in figure 4.14 as:
memops = Nc(Nm + 2F f Nm + 2) = 54.1 Mmemops
memops = Nc(Nm + 2F f Nm + 2) = 54.1 Mmemops
flops = Nc(2F f Nm + 1) = 3.1 Mflops
flops = Nc(2F f Nm + 1) = 3.1 Mflops
如果我们看一下 flops,我们会得出结论,我们一直很高效,性能会很棒。但这种算法显然将由内存带宽主导。为了估计内存带宽性能,我们需要考虑分支预测未命中。由于分支的获取频率非常低,因此分支预测未命中的可能性很高。几何形状问题具有一定的局部性,因此估计漏失率为 0.7。综上所述,我们的性能模型 (PM) 如下:
If we look at flops, we would conclude that we have been efficient and the performance would be great. But this algorithm is clearly going to be dominated by memory bandwidth. For estimating memory bandwidth performance, we need to factor in the branch prediction miss. Because the branch is taken so infrequently, the probability of a branch prediction miss is high. The geometric shapes problem has some locality, so the miss rate is estimated to be 0.7. Putting this all together, we get the following for our performance model (PM):
PM = Nc(Nm + F f Nm + 2) * 8/流 + B p F f Nc Nm = 67.2 毫秒
PM = Nc(Nm + F f Nm + 2) * 8/Stream + B p F f Nc Nm = 67.2 ms
B f = 0.7;B c = 16;Pc = 16;υ = 2.7
B f = 0.7; B c = 16; Pc = 16; υ = 2.7
分支预测未命中的成本使运行时间较高;比我们只是跳过条件并添加零的情况要高。更长的循环将摊销惩罚成本,但显然很少采用的条件并不是性能的最佳方案。我们还可以在 condition 之前插入一个 prefetch 操作,以便在 branch 被占用的情况下强制加载数据。但这会增加 memops,因此实际性能改进会很小。它还会增加内存总线上的流量,导致拥塞,从而引发其他问题,尤其是在添加线程并行性时。
The cost of the branch prediction miss makes the run time high; higher than if we just skipped the conditional and added in zeros. Longer loops would amortize the penalty cost, but clearly a conditional that is rarely taken is not the best scenario for performance. We could also insert a prefetch operation before the conditional to force loading the data in case the branch is taken. But this would increase the memops so the actual performance improvement would be small. It would also increase the traffic on the memory bus, causing congestion that would trigger other problems, especially when adding thread parallelism.
Full matrix material-centric storage
现在让我们看一下以 Material 为中心的数据结构(图 4.15)。此的 C 表示法是 variable[m][C],其中 C(或单元格)的最右侧索引变化最快。在图中,虚线表示用零填充的元素。这种数据结构的许多特征类似于以 Cell 为中心的全矩阵数据表示,但存储的索引是翻转的。
Now let’s take a look at the material-centric data structure (figure 4.15). The C notation for this is variable[m][C] with the rightmost index of C (or cells) varying fastest. In the figure, the dashes indicate elements that are filled with zeros. Many of the characteristics of this data structure are similar to the cell-centric full matrix data representation, but with the indices of the storage flipped.
图 4.15 以材料为中心的全矩阵数据结构连续存储每种材料的单元格。C 中的数组索引将是 density[m][C],单元格索引是连续的。带有短划线的单元格用零填充。
Figure 4.15 The material-centric full matrix data structure stores cells contiguously for each material. The array indexing in C would be density[m][C] with the cell index contiguous. The cells with dashes are filled with zeros.
计算每个单元的平均密度的算法可以通过连续的内存负载和一点思考来完成。实现此算法的自然方法是在 cells 上设置外部循环,在那里将其初始化为零,并在最后除以体积。但这会以非连续的方式跨过数据。我们想在内循环中的单元格上循环,需要在主循环之前和之后有单独的循环。图 4.16 显示了该算法以及 memops 和 flops 的注释。
The algorithm for computing the average density of each of the cells can be done with contiguous memory loads and a little thought. The natural way to implement this algorithm is to have the outer loop over the cells, initialize it to zero there, and divide by the volume at the end. But this strides over the data in a non-contiguous fashion. We want to loop over cells in the inner loop, requiring separate loops before and after the main loop. Figure 4.16 shows the algorithm along with annotations for memops and flops.
Figure 4.16 Material-dominant algorithm to compute average density of cells using full matrix storage
Collecting all the annotations for operations, we get
memops = 4Nc(Nm + 1) = 204 Mmemops
memops = 4Nc(Nm + 1) = 204 Mmemops
失败次数 = 2NcNm + Nc = 101 Mflops
flops = 2NcNm + Nc = 101 Mflops
This kernel is bandwidth limited, so the performance model is
PM = 4Nc(Nm + 1) * 8/流 = 122 毫秒
PM = 4Nc(Nm + 1) * 8/Stream = 122 ms
该内核的性能是以 Cell 为中心的数据结构所实现的性能的一半。但是这个计算内核偏爱以单元为中心的数据布局,压力计算的情况正好相反。
The performance of this kernel is half of what the cell-centric data structure achieved. But this computational kernel favors the cell-centric data layout, and the situation is reversed for the pressure calculation.
现在,我们将讨论几种压缩存储表示的优点和局限性。压缩的稀疏存储数据布局显然可以节省内存,但以单元格和材料为中心的布局的设计都需要一些考虑。
Now we’ll discuss the advantages and limitations of a couple of compressed storage representations. The compressed sparse storage data layouts clearly save memory, but the design for both cell- and material-centric layouts takes some thought.
Cell-centric compressed sparse storage
标准方法是每个单元的材料链表。但是链表通常很短,并且会在内存中跳来跳去。解决方案是将链表放入一个连续的数组中,链接指向材料条目的开头。下一个单元格将紧随其后。因此,在单元和材料的正常遍历期间,它们将按连续顺序访问。图 4.17 显示了以 Cell 为中心的数据存储方案。pure cells 的值保存在 cell state 数组中。在图中,1.0 是纯单元的体积分数,但它也可以是密度、温度和压力的纯单元值。第二个数组是混合单元中的材料数量。-1 表示它是一个纯单元。然后 material 链表索引 imaterial 位于第三个数组中。如果它小于 1,则条目的绝对值是混合数据存储数组的索引。如果它是 1 或更大,则它是压缩的纯元胞数组的索引。
The standard approach is a linked list of materials for each cell. But linked lists are generally short and jump all over memory. The solution is to put the linked list into a contiguous array with the link pointing to the start of the material entries. The next cell will have its materials follow right afterwards. Thus, during normal traversal of the cells and materials, these will be accessed in contiguous order. Figure 4.17 shows the cell-centric data storage scheme. The values for pure cells are kept in cell state arrays. In the figure, 1.0 is the volume fraction of the pure cells, but it can also be the pure cell values for density, temperature, and pressure. The second array is the number of materials in the mixed cell. A -1 indicates that it is a pure cell. Then the material linked list index, imaterial, is in the third array. If it is less than 1, the absolute value of the entry is the index into the mixed data storage arrays. If it is 1 or greater, then it is the index into the compressed pure cell arrays.
图 4.17 以单元为中心的数据结构的混合材料数组使用在连续数组中实现的链表。底部的不同阴影表示属于特定单元的材料,并与图 4.13 中使用的阴影相匹配。
Figure 4.17 The mixed material arrays for the cell-centric data structure use a linked list implemented in a contiguous array. The different shading at the bottom indicates the materials that belong to a particular cell and match the shading used in figure 4.13.
混合数据存储数组基本上是在标准数组中实现的链表,因此数据是连续的,以获得良好的缓存性能。混合数据以一个名为 nextfrac 的数组开始,该数组指向该单元的下一个材质。这样就可以通过将新材料添加到数组的末尾来在单元中添加这些材料。图 4.17 显示了单元格 4 的混合材质列表,其中箭头显示了最后要添加的第三种材质。frac2cell 数组是到包含材质的单元格的向后映射。第三个数组 material 包含条目的 material 编号。这些是提供压缩稀疏数据结构导航的数组。第四个数组是每个单元中每种材料的状态数组集,包括体积分数 (VF)、密度 (ρ)、温度 (t) 和压力 (p)。
Mixed data storage arrays are basically a linked list implemented in a standard array so that the data is contiguous for good cache performance. The mixed data starts with an array called nextfrac, which points to the next material for that cell. This enables the addition of new materials in the cell by adding these to the end of the array. Figure 4.17 shows this with the mixed material list for cell 4, where the arrow shows the third material to be added at the end. The frac2cell array is a backward mapping to the cell that contains the material. The third array, material, contains the material number for the entry. These are the arrays that provide the navigation around the compressed sparse data structure. The fourth array is the set of state arrays for each material in each cell with the volume fraction (Vf), density (ρ), temperature (t) and pressure (p).
混合材质数组在数组的末尾保留额外的内存,以便快速动态添加新的材质条目。删除数据链接并将其设置为零将删除材料。为了提供更好的缓存性能,数组会定期重新排序回连续内存中。
The mixed material arrays keep extra memory at the end of the array to quickly add new material entries on the fly. Removing the data link and setting it to zero deletes the materials. To give better cache performance, the arrays are periodically reordered back into contiguous memory.
图 4.18 显示了计算压缩稀疏数据布局的每个单元格的平均密度的算法。我们首先检索材料索引 imaterial,通过测试它是零还是更小来查看这是否是具有混合材料的单元格。如果它是一个纯单元,我们什么都不做,因为我们已经在元胞数组中有了密度。如果它是一个混合材料单元,我们进入一个循环,将每种材料的密度乘以体积分数相加。我们测试索引的结束条件是否变为负数,并使用 nextfrac 数组获取下一个条目。到达列表末尾后,我们计算细胞的密度 (ρ)。代码行的右侧是运营成本的注释。
Figure 4.18 shows the algorithm for the calculation of the average density for each cell for the compressed sparse data layout. We first retrieve the material index, imaterial, to see if this is a cell with mixed materials by testing if it is zero or less. If it is a pure cell, we do nothing because we already have the density in the cell array. If it is a mixed material cell, we enter a loop to sum up the density multiplied by the volume fraction for each of the materials. We test for the end condition of the index becoming negative and use the nextfrac array to get the next entry. Once we reach the end of the list, we calculate the cell’s density (ρ). To the right side of the lines of code are the annotations for the operational costs.
Figure 4.18 Cell-dominant algorithm to compute average cell density using compact storage
代码行的右侧是运营成本的注释。对于此分析,我们将有 4 字节的整数加载,因此我们将 memops 转换为 membytes。收集计数,我们得到
To the right of the lines of code are the annotations for the operational costs. For this analysis, we will have 4-byte integer loads, so we convert memops to membytes. Collecting the counts, we obtain
内存字节 = (4 + 2M f * 8)Nc + (2 * 8 + 4)= 6.74 MB
membytes = (4 + 2M f * 8)Nc + (2 * 8 + 4)= 6.74 Mbytes
flops = 2M L + M f Nc = .24 Mflops
flops = 2M L + M f Nc = .24 Mflops
同样,此算法受内存带宽限制。性能模型的估计运行时间比完整的以细胞为中心的矩阵减少了 98%。
Again, this algorithm is memory bandwidth limited. The estimated run time from the performance model is a 98% reduction from the full cell-centric matrix.
PM = 内存字节/流 + LpM f Nc = .87 毫秒
PM = membytes/Stream + LpM f Nc = .87 ms
Material-centric compressed sparse storage
以材质为中心的压缩稀疏数据结构将所有内容细分为单独的材质。回到图 4.9 中的小测试问题,我们看到有六个材料为 1 的单元格:0、1、3、4、6 和 7(如子集 1 中的图 4.19 所示)。子集中有两个映射:一个从网格到子集 (mesh2subset),另一个从子集返回到网格 (subset2mesh)。要划分网格的子集中的列表包含六个单元的索引。网格数组包含每个没有材质的单元的 -1,并按顺序对具有材质的单元进行编号以映射到子集。图 4.19 顶部的 nmats 数组包含每个单元格中包含的材料数量。图右侧的体积分数 (V、f ) 和密度 (ρ) 数组具有该材料中每个单元的值。C 命名法是 Vf[imat][icell] 和 p[imat][icell]。因为材料相对较少,单元列表很长,所以我们可以使用常规的 2D 数组分配,而不是强制它们是连续的。为了对这个数据结构进行操作,我们主要按顺序处理每个材质子集。
The material-centric compressed sparse data structure subdivides everything into separate materials. Returning to the small test problem in figure 4.9, we see that there are six cells with material 1: 0, 1, 3, 4, 6, and 7 (shown in figure 4.19 in subset 1). There are two mappings in the subset: one from mesh to subset, mesh2subset, and one from the subset back to the mesh, subset2mesh. The list in the subset to mesh has the indices of the six cells. The mesh array contains -1 for each cell that does not have the material and numbers the ones that do sequentially to map to the subset. The nmats array at the top of figure 4.19 has the number of materials contained in each cell. The volume fraction (Vf ), and density (ρ) arrays on the right side of the figure have values for each cell in that material. The C nomenclature for this would be Vf[imat][icell] and p[imat][icell]. Because there are relatively few materials with long lists of cells, we can use regular 2D array allocations rather than forcing these to be contiguous. To operate on this data structure, we mostly work with each material subset in sequence.
图 4.19 以材质为中心的压缩稀疏数据布局是围绕材质组织的。对于每种材质,都有一个可变长度数组,其中包含包含该材质的单元格列表。该底纹与图 4.15 中的底纹相对应。该图在全网格和子集以及每个子集的体积分数和密度变量之间映射。
Figure 4.19 The material-centric compressed sparse data layout is organized around materials. For each material, there is a variable-length array with a list of the cells that contain the material. The shading corresponds to the shading in figure 4.15. The illustration maps between the full mesh and subsets and the volume fraction and density variables for each subset.
图 4.20 材料主导算法使用以材料为中心的紧凑存储方案计算单元的平均密度。
Figure 4.20 Material-dominant algorithm computes the average density of cells using the material-centric compact storage scheme.
图 4.20 中压缩稀疏算法的材料主导算法与图 4.13 中的算法类似,增加了第 5、6 和 8 行中的指针检索。但是 inner loop 中的 loads 和 flops 仅针对网格的材料子集执行,而不是针对整个 mesh 执行。这为失败和回忆提供了相当大的节省。收集所有计数,我们得到
The material-dominant algorithm in figure 4.20 for the compressed sparse algorithm looks like the algorithm in figure 4.13 with the addition of the retrieval of the pointers in lines 5, 6, and 8. But the loads and flops in the inner loop are only done for the material subset of the mesh rather than the full mesh. This provides considerable savings in flops and memops. Collecting all the counts, we get
内存字节 = 5 * 8 * F f NmNc + 4 * 8 * Nc + (8 + 4) * Nm = 74 MB
membytes = 5 * 8 * F f NmNc + 4 * 8 * Nc + (8 + 4) * Nm = 74 Mbytes
失败次数 = (2F f Nm + 1)Nc = 3.1 Mflops
flops = (2F f Nm + 1)Nc = 3.1 Mflops
性能模型显示,与以材料为中心的全矩阵数据结构相比,估计运行时间减少了 95% 以上:
The performance model shows more than a 95% reduction in estimated run time from the material-centric full matrix data structure:
表 4.1 总结了这四种数据结构的结果。估计的运行时间与测得的运行时间之间的差异非常小。这表明,即使是内存负载的粗略计数也可以很好地预测性能。
Table 4.1 summarizes the results for these four data structures. The difference between the estimated and measured run time is remarkably small. This shows that even rough counts of memory loads can be a good predictor of performance.
表 4.1 稀疏数据结构比完整的 2D 矩阵更快,使用的内存更少。
Table 4.1 The sparse data structures are faster and use less memory than the full 2D matrices.
压缩的稀疏表示法的优点是巨大的,可以节省内存和性能。因为我们分析的内核更适合以 Cell 为中心的数据结构,所以以 Cell 为中心的压缩稀疏数据结构显然在内存和运行时方面都是表现最好的。如果我们看一下另一个显示以材料为中心的数据布局的内核,结果略微有利于以材料为中心的数据结构。但最大的收获是,压缩的稀疏表示形式中的任何一种都比完整矩阵表示形式有了巨大的改进。
The advantage of the compressed sparse representations is dramatic, with savings in both memory and performance. Because the kernel we analyzed was more suited for the cell-centric data structures, the cell-centric compressed sparse data structure is clearly the best performer both in memory and run time. If we look at the other kernel that shows the material-centric data layout, the results are slightly in favor of the material-centric data structures. But the big takeaway is that either of the compressed sparse representations is a vast improvement over the full matrix representations.
虽然本案例研究侧重于多材料数据表示,但有许多不同的稀疏数据应用程序可以从添加压缩的稀疏数据结构中受益。与本节中所做的类似的快速性能分析可以确定在这些应用程序中是否值得额外努力。
While this case study focused on multi-material data representations, there are many diverse applications with sparse data that can benefit from the addition of a compressed sparse data structure. A quick performance analysis similar to the one done in this section can determine whether the benefits are worth the additional effort in these applications.
有更高级的性能模型可以更好地捕获计算机硬件的各个方面。我们将简要介绍这些高级模型,以了解它们提供的内容以及可能吸取的经验教训。性能分析的细节不如要点重要。
There are more advanced performance models that better capture aspects of the computer hardware. We will briefly cover these advanced models to understand what these offer and the possible lessons to be learned. The details of the performance analysis are not as important as the takeaways.
在本章中,我们主要关注带宽受限的内核,因为它们代表了大多数应用程序的性能限制。我们计算了内核加载和存储的字节数,并根据流基准或 roofline 模型(第 3 章)估计了此数据移动所需的时间。到目前为止,您应该意识到计算机硬件的操作单位并不是真正的字节或字,而是缓存行,我们可以通过计算需要加载和存储的缓存行来改进性能模型。同时,我们可以估计使用了多少 cache 行。
In this chapter, we focused primarily on bandwidth-limited kernels because these represent the performance limitations of most applications. We counted the bytes loaded and stored by the kernel and estimated the time required for this data movement based on the stream benchmark or roofline model (chapter 3). By now, you should realize that the unit of operation for computer hardware is not really bytes or words but cache lines, and we can improve the performance models by counting the cache lines that need to be loaded and stored. At the same time, we can estimate how much of the cache line is used.
流基准测试实际上由四个单独的内核组成:copy、scale、add 和 triad 内核。那么,为什么这些内核之间的带宽会发生变化 (16156.6-22086.5 MB/s),如 3.2.4 中的 STREAM 基准测试练习所示呢?因此,暗示原因是 3.2.4 节中表格所示的内核之间算术强度的差异。这只是部分正确。只要我们处于带宽受限的范围内,算术运算的微小差异实际上就是一个非常小的影响。与算术运算的相关性也不高。为什么 scale 操作的带宽最低?真正的罪魁祸首是系统缓存层次结构中的详细信息。缓存系统不像 Stream 基准测试所暗示的那样,水稳定地流过它。它更像是一个桶旅,通过不同数量的桶和大小将数据运送到缓存级别,如图 4.21 所示。这正是 Treibig 和 Hager 开发的执行缓存内存 (ECM) 模型试图捕获的内容。尽管它需要硬件架构方面的知识,但它可以很好地预测流式内核的性能。级别之间的移动可能会受到操作数量 (μops) 的限制,这些操作称为微操作,可以在单个周期中执行。ECM 模型根据缓存行和周期工作,对不同缓存级别之间的移动进行建模。
The stream benchmark is actually composed of four individual kernels: the copy, scale, add, and triad kernels. So why the variation in the bandwidth (16156.6-22086.5 MB/s) among these kernels as seen in the STREAM Benchmark exercise in 3.2.4? It was implied then that the cause was the difference in arithmetic intensity among the kernels shown in the table in section 3.2.4. This is only partly true. The small difference in arithmetic operations is really a pretty minor influence as long as we are in the bandwidth-limited regime. The correlation with the arithmetic operations is also not high. Why does the scale operation have the lowest bandwidth? The real culprits are the details in the cache hierarchy of the system. The cache system is not like a pipe with water flowing steadily through it as might be implied by the stream benchmark. It is more like a bucket brigade ferrying data up the cache levels with varying numbers of buckets and sizes as figure 4.21 shows. This is exactly what the Execution Cache Memory (ECM) model developed by Treibig and Hager tries to capture. Although it requires knowledge of the hardware architecture, it can predict the performance extremely well for streaming kernels. Movement between levels can be limited by the number of operations (µops), called micro-ops, that can be performed in a single cycle. The ECM model works in terms of cache lines and cycles, modeling the movement between the different cache levels.
图 4.21 缓存级别之间的数据移动是一系列离散的操作,更像是一个桶旅,而不是通过管道的流。硬件的详细信息以及在每个级别和每个方向上可以发出的负载数量在很大程度上影响了通过高速缓存层次结构加载数据的效率。
Figure 4.21 The movement of data between cache levels is a series of discrete operations, more like a bucket brigade than a flow through a pipe. The details of the hardware and how many loads can be issued at each level and in each direction largely impact the efficiency of loading data through the cache hierarchy.
让我们快速看一三元组的 ECM 模型 (A[i] = B[i] + s*C[i]),看看这个模型是如何工作的(图 4.22)。必须针对特定的内核和硬件进行此计算。我们将使用 Haswell EP 系统作为此分析的硬件。我们从计算核心开始,公式 Tcore = max(TnOL,T OL),其中 T 是周期时间。TOL 通常是与数据传输时间重叠的算术运算,TnOL 是非重叠的数据传输时间。
Let’s just take a quick look at the ECM model for the stream triad (A[i] = B[i] + s*C[i]) to see how this model works (figure 4.22). This calculation must be done for the specific kernel and hardware. We’ll use a Haswell EP system for the hardware for this analysis. We start at the computational core with the equation Tcore = max(TnOL,TOL), where T is time in cycles. TOL is generally the arithmetic operations that overlap the data transfer time, and TnOL is the non-overlapping data transfer time.
图 4.22 Haswell 处理器的执行缓存内存 (ECM) 模型为缓存级别之间流三元组计算的数据传输提供了详细的计时。如果数据在主内存中,则将数据传输到 CPU 所需的时间是每个缓存级别之间的传输时间之和,即 21.7 + 8 + 5 + 3 = 37.7 个周期。浮点运算只需要 3 个周期,因此内存负载是流三元组的限制方面。
Figure 4.22 The Execution Cache Memory (ECM) model for the Haswell processor provides a detailed timing for the data transfer of the stream triad computation between cache levels. If the data is in main memory, the time it takes to get the data to the CPU is the sum of the transfer times between each cache level or 21.7 + 8 + 5 + 3 = 37.7 cycles. The floating-point operations only take 3 cycles, so the memory loads are the limiting aspect for the stream triad.
对于流三元组,我们有一个乘加运算的 cache 行。如果这是通过标量运算完成的,则需要 8 个周期才能完成。但是我们可以使用新的高级向量扩展 (AVX) 指令来做到这一点。Haswell 芯片具有两个融合乘加 (FMA) AVX 256 位向量单元。这些单元中的每一个都处理四个双精度值。缓存行中有八个值,因此两个 FMA AVX 向量单元可以在一个周期内处理此值。TnOL 是数据传输时间。我们需要为 B 和 C 加载缓存行,我们需要为 A 加载和存储缓存行。由于 Address Generation Units (AGU) 的限制,Haswell 芯片需要 3 个周期。
For the stream triad, we have a cache line of multiply-add operations. If this is done with a scalar operation, it takes 8 cycles to complete. But we can do this with the new Advanced Vector Extensions (AVX) instructions. The Haswell chip has two fused multiply-add (FMA) AVX 256-bit vector units. Each of these units processes four double-precision values. There are eight values in a cache line, so two FMA AVX vector units can process this in one cycle. TnOL is the data transfer time. We need to load cache lines for B and C, and we need to load and store a cache line for A. This takes 3 cycles for the Haswell chip because of a limitation of the address generation units (AGUs).
以 64 字节/周期的速度将四行缓存从 L2 移动到 L1 需要 4 个周期。但是 A[i] 的使用是一种 store 操作。存储通常需要一种称为 write-allocate 的特殊负载,其中内存空间在虚拟数据管理器中分配,并在必要的缓存级别创建缓存行。然后,数据被修改并从缓存中逐出(存储)。在此缓存级别,这只能以 32 字节/周期运行,从而导致额外的周期或总共 5 个周期。从 L3-L2 开始,数据传输为 32 字节/周期,因此需要 8 个周期。最后,使用测得的 27.1 GB/s 带宽,从主内存移动高速缓存行的周期数约为 21.7 个周期。ECM 使用以下特殊表示法来总结这些数字:
Moving the four cache lines from L2 to L1 at 64 bytes/cycle takes 4 cycles. But the use of A[i] is a store operation. A store generally requires a special load called a write-allocate, where the memory space is allocated in the virtual data manager, and the cache line created at the necessary cache levels. Then the data is modified and evicted (stored) from the cache. This can only operate at 32 bytes/cycle at this level of cache, resulting in an additional cycle or a total of 5 cycles. From L3-L2, the data transfer is 32 bytes/cycle, so it takes 8 cycles. And finally, using the measured bandwidth of 27.1 GB/s, the number of cycles to move the cache lines from main memory is about 21.7 cycles. ECM uses this special notation to summarize these numbers:
{TOL ||TnOL 型 |TL1L2 |TL2L3 |TL3Mem} = {1 || 3 | 5 | 8 | 21.7} 个周期
{TOL || TnOL | TL1L2 | TL2L3 | TL3Mem} = {1 || 3 | 5 | 8 | 21.7} cycles
T磁芯由 TOL ||表示法中的 T nOL。这些本质上是在每个级别之间移动的时间(以周期为单位),对于 Tcore 有一个特殊情况,其中计算内核上的一些操作可能会重叠从 L1 到 registers 的一些数据传输操作。然后,该模型通过对数据传输时间求和来预测从缓存的每个级别加载所需的周期数,包括从 L1 到 registers 的非重叠数据传输。然后,TOL 的最大值和数据传输时间将用作预测时间:
The Tcore is shown by the TOL || TnOL in the notation. These are essentially the times (in cycles) to move between each level, with a special case for the Tcore, where some of the operations on the computational core can overlap some of the data transfer operations from L1 to the registers. Then the model predicts the number of cycles it would take to load from each level of the cache by summing the data transfer time, including the non-overlapping data transfers from L1 to registers. The max of the TOL and the data transfer time is then used as the predicted time:
This special ECM notation shows the resulting prediction for each cache level:
这种表示法表示,当内核在 L1 缓存之外运行时需要 3 个周期,从 L2 缓存中需要 8 个周期,从 L3 中需要 16 个周期,当必须从主内存中检索数据时需要 37.7 个周期。
This notation says that the kernel takes 3 cycles when it operates out of the L1 cache, 8 out of L2 cache, 16 out of L3, and 37.7 cycles when the data has to be retrieved from main memory.
从这个例子中可以学到的是,在具有特定内核的特定芯片上遇到离散硬件限制可能会在缓存级别之间的一次传输中强制再进行一两个周期,从而导致性能降低。稍有不同的处理器版本可能不会有相同的问题。例如,更高版本的 Intel 芯片添加了另一个 AGU,它将 L1 寄存器周期从 3 更改为 2。
What can be learned from this example is that bumping up against a discrete hardware limit on a particular chip with a particular kernel can force another cycle or two at one of the transfers between cache levels, causing slower performance. A slightly different version of the processor might not have the same problem. For example, later versions of Intel chips add another AGU, which changes the L1-register cycles from 3 to 2.
此示例还演示了向量单位对于算术运算和数据移动都有价值。向量载荷(也称为四重载荷操作)并不新鲜。在对向量处理器的讨论中,大部分焦点都集中在算术运算上。但对于带宽受限的内核,向量内存操作可能更重要。Stengel 等人使用 ECM 模型进行的一项分析表明,AVX 向量指令的性能比编译器天真地安排的循环高出两倍。这可能是因为编译器没有足够的可用信息。更新的向量单元还实现了 gather/scatter 内存加载操作,其中加载到向量单元的数据不必位于连续的内存位置 (gather),并且从向量到内存的存储不必是连续的内存位置 (scatter)。
This example also demonstrates that the vector units have value for both arithmetic operations and data movement. The vector load, also known as a quad-load operation, is not new. Much of the focus in the discussion on vector processors is on the arithmetic operations. But for bandwidth-limited kernels, it is likely that the vector memory operations are more important. An analysis by Stengel, et al. using the ECM model shows that the AVX vector instructions can give a two times performance improvement over loops that the compiler naively schedules. This is perhaps because the compiler does not have enough information available. More recent vector units also implement a gather/scatter memory load operation where the data loaded into the vector unit does not have to be in contiguous memory locations (gather) and the store from the vector to memory does not have to be contiguous memory locations (scatter).
注意这种新的收集/分散内存负载功能很受欢迎,因为许多实际数值仿真代码都需要它才能很好地执行。但是,当前的 gather/scatter 实现仍然存在性能问题,需要进一步改进。
Note This new gather/scatter memory load feature is welcomed as many real numerical simulation codes need it to perform well. But there are still performance issues with the current gather/scatter implementation and more improvement is needed.
我们还可以使用 streaming store 分析缓存层次结构的性能。流式处理存储绕过缓存系统并直接写入主内存。大多数编译器都有一个使用 streaming stores 的选项,有些编译器会自行将其作为优化调用。其效果是减少在缓存层次结构级别之间移动的缓存行数,从而减少拥塞和缓存级别之间的逐出操作速度变慢。现在您已经看到了 cache-line 移动的效果,您应该能够理解它的价值。
We can also analyze the performance of the cache hierarchy with the streaming store. The streaming store bypasses the cache system and writes directly to main memory. There is an option in most compilers to use streaming stores, and some invoke it as an optimization on their own. Its effect is to reduce the number of cache lines being moved between levels of the cache hierarchy, reducing congestion and the slower eviction operation between levels of the cache. Now that you have seen the effect of the cache-line movement, you should be able to appreciate its value.
ECM 模型被几位研究人员用于评估和优化模板内核。模板内核是流式处理操作,可以使用这些技术进行分析。在不犯错误的情况下跟踪所有缓存行和硬件特性会变得有点混乱,因此性能计数工具可以提供帮助。我们将向您推荐附录 A 中列出的一些参考文献,以获取有关这些内容的更多信息。
The ECM model is used to evaluate and optimize stencil kernels by several researchers. Stencil kernels are streaming operations and can be analyzed with these techniques. It gets a little messy to keep track of all the cache lines and hardware characteristics without making mistakes, so performance counting tools can help. We’ll refer you to a couple of references listed in appendix A for further information on these.
高级模型非常适合了解相对简单的流式处理内核的性能。流式内核是指以近乎最佳的方式加载数据以有效使用缓存层次结构的内核。但是,科学和 HPC 应用程序中的内核通常很复杂,具有条件、不完美嵌套的循环、缩减和循环携带的依赖项。此外,编译器可以以意想不到的方式将高级语言转换为汇编程序操作,这会使分析复杂化。通常还有很多 kernel 和 loops 需要处理。在没有专用工具的情况下分析这些复杂的内核是不可行的,因此我们尝试从简单的内核中发展出可以应用于更复杂的内核的一般思想。
The advanced models are great for understanding the performance of relatively simple streaming kernels. Streaming kernels are those that load data in a nearly optimal way to effectively use the cache hierarchy. But kernels in scientific and HPC applications are often complex with conditionals, imperfectly nested loops, reductions, and loop-carried dependencies. In addition, compilers can transform the high-level language to assembler operations in unexpected ways, which complicate the analysis. There are usually a lot of kernels and loops to deal with as well. It is not feasible to analyze these complex kernels without specialized tools, so we try to develop general ideas from the simple kernels that we can apply to the more complex ones.
我们可以扩展我们的数据传输模型以用于分析计算机网络。集群或 HPC 系统节点之间的简单网络性能模型是
We can extend our data transfer models for use in analyzing the computer network. A simple network performance model between nodes of a cluster or an HPC system is
时间 (ms) = 延迟 (μsec) + bytes_moved (MB) /(带宽 (GB/s) (使用单位转换)
Time (ms) = latency (µsecs) + bytes_moved (MBytes) /(bandwidth (GB/s) (with unit conversions)
请注意,这是一个网络带宽,而不是我们一直在使用的内存带宽。有一个用于延迟和带宽的 HPC 基准测试站点,网址为
Note that this is a network bandwidth rather than the memory bandwidth we have been using. There is an HPC benchmark site for latency and bandwidth at
http://icl.cs.utk.edu/hpcc/hpcc_results_lat_band.cgi
http://icl.cs.utk.edu/hpcc/hpcc_results_lat_band.cgi
我们可以使用 HPC 基准测试站点的网络微基准测试来获取典型的延迟和带宽数字。我们将使用 5 μsec 的延迟和 1 GB/s 的带宽。这给了我们图 4.23 所示的图。对于较大的消息,我们可以估计传输的每 MB 大约需要 1 秒。但绝大多数消息都很小。我们看两个不同的通信示例,首先是较大的消息,然后是较小的消息,以了解延迟和带宽在每个示例中的重要性。
We can use the network micro-benchmarks from the HPC benchmark site to get typical latency and bandwidth numbers. We’ll use 5 µsecs for the latency and 1 GB/s for the bandwidth. This gives us the plot shown in figure 4.23. For larger messages, we can estimate about 1 s for every MB transferred. But the vast majority of messages are small. We look at two different communication examples, first a larger message and then a smaller one, to understand the importance of latency and bandwidth in each.
图 4.23 典型的网络传输时间与消息大小的函数关系为我们提供了一个经验法则:1 MB 需要 1 秒,1 KB 需要 1 毫秒(毫秒),或者 1 字节需要 1 微秒。
Figure 4.23 Typical network transfer time as a function of the size of the message gives us a rule of thumb: 1 MB takes 1 s (second), 1 KB takes 1 ms (millisecond), or 1 byte takes 1 microsec.
最后一个 sum 示例是计算机科学术语中的缩减运算。处理器中的单元格计数数组减少为单个值。更一般地说,归约运算是指将 1 到 N 维的多维数组缩减为至少一个小一维并且通常为标量值的任何操作。这些是并行计算中的常见操作,涉及处理器之间的协作才能完成。此外,最后一个示例中的缩减和可以以树状模式成对方式执行,通信跃点数为 log2N,其中 N 是 rank(处理器)的数量。当处理器数量达到数千个时,操作时间会变长。也许更重要的是,所有处理器都必须在操作时同步,这导致许多处理器等待其他处理器进行 reduction 调用。
The last sum example is a reduction operation in computer science lingo. An array of cell counts across the processors is reduced into a single value. More generally, a reduction operation is any operation where a multidimensional array from 1 to N dimensions is reduced to at least one dimension smaller and often to a scalar value. These are common operations in parallel computing and involve cooperation among the processors to complete. Also, the reduction sum in the last example can be performed in pair-wise fashion in a tree-like pattern with the number of communication hops being log2N, where N is the number of ranks (processors). When the number of processors reaches into the thousands, the time for the operation grows larger. Perhaps more importantly, all of the processors have to synchronize at the operation, leading to many of those waiting for the other processors to get to the reduction call.
网络消息有更复杂的模型,这些模型可能对特定网络硬件有用。但是,网络硬件的细节差异很大,因此可能无法很好地说明所有可能硬件的一般行为。
There are more complex models for network messages that might be useful for specific network hardware. But the details of network hardware vary enough that these may not shed much light on the general behavior across all possible hardware.
以下是用于探索本章主题的一些资源,包括面向数据的设计、数据结构和性能模型。大多数应用程序开发人员发现有关面向数据设计的附加材料很有趣。许多应用程序可以利用稀疏性,我们可以从压缩稀疏数据结构的案例研究中了解如何操作。
Here are some resources for exploring the topics in this chapter, including data-oriented design, data structures, and performance models. Most application developers find the additional materials on data-oriented design to be interesting. Many applications can exploit sparsity, and we can learn how from the case study on compressed sparse data structures.
以下两个参考资料很好地描述了游戏社区中开发的面向数据的设计方法,用于将性能构建到程序设计中。第二个参考文献还给出了 Acton 在 CppCon 上的演讲视频的位置。
The following two references give good descriptions of the data-oriented design approach developed in the gaming community for building performance into program design. The second reference also gives the location of the video of Acton’s presentation at CppCon.
Noel Llopis,“面向数据的设计(或为什么你可能会用 OOP 搬起石头砸自己的脚)”(2009 年 12 月)。2021 年 2 月 21 日访问。http://gamesfromwithin .com/data-oriented-design。
Noel Llopis, “Data-oriented design (or why you might be shooting yourself in the foot with OOP)” (Dec, 2009). Accessed February 21, 2021. http://gamesfromwithin .com/data-oriented-design.
Mike Acton 和 Insomniac Games,“面向数据的设计和 C++”。在 CppCon 上的演讲(2014 年 9 月):
Mike Acton and Insomniac Games, “Data-oriented design and C++.” Presentation at CppCon (September, 2014):
https://github.com/CppCon/CppCon2014 的 Powerpoint
Powerpoint at https://github.com/CppCon/CppCon2014
以下参考有助于更详细地介绍使用简单性能模型的压缩稀疏数据结构的案例研究。您还可以找到在多核和 GPU 上测得的性能结果:
The following reference is good for going into more detail on the case study of compressed sparse data structures using simple performance models. You’ll also find measured performance results on multi-core and GPUs:
Shane Fogerty、Matt Martineau 等人,“用于计算物理应用的多材料数据结构的比较研究”。在计算机与数学与应用卷78,第2期(2019年7月):565-581。源代码可在 https://github.com/LANL/MultiMatTest 上获得。
Shane Fogerty, Matt Martineau, et al., “A comparative study of multi-material data structures for computational physics applications.” In Computers & Mathematics with Applications Vol. 78, no. 2 (July, 2019): 565-581. The source code is available at https://github.com/LANL/MultiMatTest.
以下文章介绍了用于 Execution Cache Model 的速记表示法:
The following paper introduces the shorthand notation used for the Execution Cache Model:
Holger Stengel、Jan Treibig 等人,“使用执行缓存内存模型量化模板计算的性能瓶颈”。第 29 届 ACM 超级计算国际会议论文集 (ACM, 2015):207-216。
Holger Stengel, Jan Treibig, et al., “Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model.” In Proceedings of the 29th ACM on International Conference on Supercomputing (ACM, 2015): 207-216.
Write a 2D contiguous memory allocator for a lower-left triangular matrix.
Write a 2D allocator for C that lays out memory the same way as Fortran.
Design a macro for an Array of Structures of Arrays (AoSoA) for the RGB color model in section 4.1.
Modify the code for the cell-centric full matrix data structure to not use a conditional and estimate its performance.
How would an AVX-512 vector unit change the ECM model for the stream triad?
Data structures are at the foundation of application design and often dictate performance and the resulting implementation of parallel code. It is worth a little additional effort to develop a good design for the data layout.
You can use the concepts of data-oriented design to develop higher performing applications.
There are ways to write contiguous memory allocators for multidimensional arrays or special situations to minimize memory usage and improve performance.
You can use compressed storage structures to reduce your application’s memory usage while also improving performance.
Simple performance models based on counting loads and stores can predict the performance of many basic kernels.
More complex performance models shed light on the performance of the cache hierarchy with respect to low-level details in the hardware architecture.
算法是计算科学的核心。与上一章中介绍的数据结构一起,算法构成了所有计算应用程序的基础。因此,请务必仔细考虑代码中的关键算法。首先,让我们定义一下并行算法和并行模式的含义。
Algorithms are at the core of computational science. Along with data structures, covered in the previous chapter, algorithms form the basis of all computational applications. For this reason, it is important to give careful thought to the key algorithms in your code. To begin, let’s define what we mean by parallel algorithms and parallel patterns.
A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of algorithms include sorting, searching, optimization, and matrix operations.
并行模式是并发的、可分离的代码片段,它以一定的频率出现在不同的场景中。这些代码片段本身通常不能解决感兴趣的完整问题。一些示例包括缩减、前缀扫描和幻影单元更新。
A parallel pattern is a concurrent, separable fragment of code that occurs in diverse scenarios with some frequency. By themselves, these code fragments generally do not solve complete problems of interest. Some examples include reductions, prefix scans, and ghost cell updates.
我们将在第 5.7 节中展示减少,在第 5.6 节中展示前缀扫描,在第 8.4.2 节中展示 ghost cell 更新。在一个上下文中,并行过程可以被视为一种算法,而在另一个上下文中,它可以被视为一种模式。真正的区别在于它是实现主要目标,还是只是更大背景的一部分。识别 “并行友好” 模式对于为以后的并行化工作做准备非常重要。
We will show the reduction in section 5.7, the prefix scan in section 5.6, and ghost cell updates in section 8.4.2. In one context, a parallel procedure can be considered an algorithm, and in another, it can be a pattern. The real difference is whether it is accomplishing the main goal or just part of a larger context. Recognizing patterns that are “parallel friendly” is important to prepare for later parallelization efforts.
并行算法的发展是一个年轻的领域。即使是分析并行算法的术语和技术也仍然停留在串行世界中。评估算法的一种更传统的方法是查看其算法复杂性。我们对算法复杂性的定义如下。
The development of parallel algorithms is a young field. Even the terminology and techniques to analyze parallel algorithms are still stuck in the serial world. One of the more traditional ways to evaluate algorithms is by looking at their algorithmic complexity. Our definition of algorithmic complexity follows.
定义 算法复杂性是衡量完成算法所需的操作数的指标。算法复杂性是算法的一个属性,是过程中工作量或操作量的度量。
Definition Algorithmic complexity is a measure of the number of operations that it would take to complete an algorithm. Algorithmic complexity is a property of the algorithm and is a measure of the amount of work or operations in the procedure.
复杂性通常用渐近表示法表示。渐近表示法是一种指定性能限制边界的表达式类型。基本上,该表示法标识运行时间是线性增长,还是随着问题的大小以更快的速度进行。该表示法使用字母 O 的各种形式,例如 O(N)、O(N log N ) 或 O(N 2)。N 是长数组的大小,例如单元、粒子或元素的数量。O() 和 N 的组合是指算法的成本如何随着数组大小 N 的增长而扩展。O 可以被认为是“顺序”,就像“按顺序的刻度”一样。通常,N 项的简单循环将是 O(N),双嵌套循环将是 O(N 2),基于树的算法将是 O(N log N)。按照惯例,前导常量会被丢弃。最常用的渐近符号是
Complexity is usually expressed in asymptotic notation. Asymptotic notation is a type of expression that specifies the limiting bounds of performance. Basically, the notation identifies whether the run time grows linearly or whether it progresses at a more accelerated rate with the problem’s size. The notation uses various forms of the letter O, such as O(N ), O(N log N ) or O(N 2). N is the size of a long array such as the number of cells, particles, or elements. The combination of O() and N refers to how the cost of the algorithm scales as the size N of the array grows. The O can be thought of as “order” as in “scales on the order of.” Generally, a simple loop over N items will be O(N ), a double-nested loop will be O(N 2), and a tree-based algorithm will be O(N log N ). By convention, the leading constants are dropped. The most commonly used asymptotic notations are
大 O - 这是算法性能的最坏情况限制。例如,对于大小为 N 的大型数组,其双重嵌套 for 循环的复杂度为 O(N 2)。
Big O—This is the worst case limit of an algorithm’s performance. Examples are a doubly nested for loop for a large array of size N, which would be O(N 2) complexity.
Big Ω (Big Omega)—The best case performance of an algorithm.
Big Θ (Big Theta)—The average case performance of an algorithm.
传统的算法分析可以互换使用算法复杂性、计算复杂性和时间复杂性。我们将以不同的方式定义这些术语,以帮助我们评估当今并行计算硬件上的算法。时间不会随着工作量的增加而变化,计算工作量或成本也不会随之变化。因此,我们对计算复杂度和时间复杂度的定义进行了以下调整:
Traditional analysis of algorithms uses algorithmic complexity, computational complexity, and time complexity interchangeably. We will define the terms a bit differently to help us evaluate algorithms on today’s parallel computing hardware. Time doesn’t scale with the amount of work and neither does the computational effort or cost. We thus make the following adjustments to the definitions for computational complexity and time complexity:
计算复杂度(也称为步骤复杂度)是完成算法所需的步骤数。此复杂性度量是实现和用于计算的硬件类型的属性。它包括可能的并行度。如果您使用的是向量或多核计算机,则一个步骤(周期)可以是四个或更多浮点运算。是否可以使用这些额外的操作来减少步骤数?
Computational complexity (also called step complexity) is the number of steps that are needed to complete an algorithm. This complexity measurement is an attribute of the implementation and the type of hardware that is used for the calculation. It includes the amount of parallelism that is possible. If you’re using a vector or multi-core computer, a step (cycle) can be four or more floating-point operations. Can you use these additional operations to reduce the number of steps?
Time complexity takes into account the actual cost of an operation on a typical modern computing system. The largest adjustment for time is to consider the cost of memory loads and the caching of data.
我们将使用复杂性分析进行一些算法比较,例如 Section 5.5 中的 prefix sum 算法。但对于应用计算机科学家来说,算法的渐近复杂性在某种程度上是一维的,用途有限。它只告诉我们算法在限制中的成本,因为它会变大。在应用的设置中,我们需要一个更完整的算法模型。我们将在下一节中看到原因。
We’ll use complexity analysis for some of our algorithm comparisons, such as the prefix sum algorithms in section 5.5. But for applied computer scientists, the asymptotic complexity of an algorithm is somewhat one-dimensional and of limited use. It only tells us the cost of an algorithm in the limit as it grows larger. In an applied setting, we need a more complete model of an algorithm. We’ll see why in the next section.
我们在第 4 章中首先介绍了性能模型,以分析不同数据结构的相对性能。在性能模型中,我们构建了比算法复杂性分析更完整的算法性能描述。最大的区别是,我们没有在算法前面隐藏常数乘数。但是在术语上也存在差异,例如用于缩放的 log N。实际的操作计数来自二叉树,应为 log2 N。
We first introduced performance models in chapter 4 to analyze the relative performance of different data structures. In a performance model, we build a much more complete description of the performance of an algorithm than in algorithmic complexity analysis. The biggest difference is that we don’t hide the constant multiplier in front of the algorithm. But there is also a difference in the terms such as log N for scaling. The actual count of operations is from a binary tree and should be log2 N.
在传统的算法复杂性分析中,两个对数项之间的差异是一个常数,该常数被吸收到常数乘数中。在常见的世界问题中,这些常数可能很重要并且不会抵消;因此,我们需要使用性能模型来区分类似算法的不同方法。为了进一步了解使用性能模型的好处,让我们从日常生活中的一个示例开始。
In traditional algorithmic complexity analysis, the difference between the two logarithmic terms is a constant that gets absorbed into the constant multiplier. In common world problems, these constants could matter and don’t cancel out; therefore, we need to use a performance model to differentiate between different approaches of a similar algorithm. To further understand the benefits of using performance models, let’s start with an example from everyday life.
回到示例中的第一个算法,为简单起见,我们假设在将数据包分发给参与者后,文件夹仍然存在,以便文件夹数量保持原始数量不变。有 N 个参与者和 N 个文件夹,创建了一个双嵌套的循环。计算是 N 2 阶运算,或最坏情况下的 Big O 渐近表示法中的 O(N 2)。如果文件夹每次都减少,则计算将是 (N + N - 1 + N - 2 . . .) 或 O(N 2) 算法。第二种算法可以通过二分搜索来利用文件夹的排序顺序,因此可以在 O(NlogN ) 操作中针对最坏情况执行该算法。
Returning to the first algorithm in the example, let’s assume for simplicity’s sake that the folders remain after the packets are handed out to a participant, such that the number of folders stays constant at the original number. There are N participants and N folders, creating a doubly nested loop. The computation is of order N 2 operations, or O(N 2) in Big O asymptotic notation for the worst case. If the folders decreased each time, the computation would be (N + N - 1 + N - 2 . . .) or an O(N 2) algorithm. The second algorithm can exploit the sorted order of the folders with a bisection search so the algorithm can be done for the worst case in O(NlogN ) operations.
渐近复杂性告诉我们,当我们达到大型 (如 100 万参与者) 时,算法的性能如何。但我们永远不会有 100 万参与者。我们将有 100 名参与者的有限规模。对于有限大小,需要更完整的算法性能图。
Asymptotic complexity tells us how the algorithm performs as we reach large sizes, such as one million participants. But we won’t ever have one million participants. We will have a finite size of 100 participants. For finite sizes, a more complete picture of the algorithmic performance is needed.
为了说明这一点,我们将性能模型应用于最基本的计算机算法之一,看看它如何为我们提供更多见解。我们将使用基于时间的模型,其中包括实际硬件成本,而不是基于操作的计数。在此示例中,我们查看 bisection 搜索,也称为 binary search。它是最常见的计算机算法和数值优化技术之一。传统的渐近分析表明,二分搜索比线性搜索快得多。我们将证明,考虑到真实计算机的运行方式,速度的提高并不像预期的那么多。此分析还有助于解释 Section 5.5.1 中的 table 查找结果。
To illustrate this, we apply a performance model to one of the most basic computer algorithms to see how it might give us more insight. We’ll use a time-based model where we include the real hardware costs rather than an operation-based count. In this example, we look at a bisection search, also known as a binary search. It is one of the most common computer algorithms and numerical optimization techniques. Conventional asymptotic analysis says that the binary search is much faster than a linear search. We’ll show that accounting for how real computers function, the increase in speed is not as much as might be expected. This analysis also helps to explain the table lookup results in section 5.5.1.
尽管渐近复杂性用于理解算法缩放时的性能,但它并未提供绝对性能的方程式。对于给定的问题,线性搜索(线性缩放)可能优于对数缩放的等分搜索。当您并行化算法时尤其如此,因为线性缩放算法的扩展比对数缩放的算法要简单得多。此外,计算机被设计为线性遍历数组并预取数据,这可以进一步提高性能。最后,特定问题可以使项出现在数组的开头,其中等分搜索的性能比线性搜索差得多。
Though asymptotic complexity is used to understand performance as an algorithm scales, it does not provide an equation for absolute performance. For a given problem, linear search, which scales linearly, might outperform a bisection search, which scales logarithmically. This is especially true when you parallelize the algorithm as it is much simpler to scale a linearly scaled algorithm than a logarithmically scaled algorithm. In addition, the computer is designed to linearly walk through an array and prefetch data, which can speedup performance a little more. Finally, the specific problem can have the item occurring at the beginning of the array, where a bisection search performs much worse than a linear search.
作为其他并行考虑的示例,让我们看看在多核 CPU 或 GPU 的 32 个线程上实现搜索。在每个操作期间,线程集必须等待最慢的线程完成。等分搜索始终需要 4 次缓存加载。线性搜索在每个线程所需的高速缓存行数中有所不同。最坏的情况控制操作所需的时间,使成本更接近 16 个缓存行,而不是平均 8 个缓存行。
As an example of other parallel considerations, let’s look at the implementation of the search on 32 threads of a multi-core CPU or a GPU. The set of threads must wait for the slowest to complete during each operation. The bisection search always takes 4 cache loads. The linear search varies in the number of cache lines required for each thread. The worst case controls how long the operation takes, making the cost closer to 16 cache lines than the average of 8 cache lines.
所以你问,这在实践中是如何运作的?让我们看看示例中描述的表查找代码的两种变体。您可以使用本章随附的源代码中包含的完美哈希代码在系统上测试以下算法,网址为 https://github.com/EssentialsofParallelComputing/Chapter5。首先,下面的清单显示了表查找代码的线性搜索算法版本。
So you ask, how does this work in practice? Let’s look at the two variations of the table lookup code described in the example. You can test the following algorithms on your system with the perfect hash code included in the accompanying source code for this chapter at https://github.com/EssentialsofParallelComputing/Chapter5. First, the following listing shows the linear search algorithm version of the table lookup code.
Listing 5.1 Linear search algorithm in a table lookup
PerfectHash/table.c
268 double *interpolate_bruteforce(int isize, int xstride,
int d_axis_size, int t_axis_size, double *d_axis, double *t_axis,
269 double *dens_array, double *temp_array, double *data)
270 {
271 int i;
272
273 double *value_array=(double *)malloc(isize*sizeof(double));
274
275 for (i = 0; i<isize; i++){
276 int tt, dd;
277
276 int tt, dd;
277
278 for (tt=0; tt<t_axis_size-2 && ❶
temp_array[i] > t_axis[tt+1]; tt++); ❶
279 for (dd=0; dd<d_axis_size-2 && ❶
dens_array[i] > d_axis[dd+1]; dd++); ❶
280
281 double xf = (dens_array[i]-d_axis[dd])/ ❷
(d_axis[dd+1]-d_axis[dd]); ❷
282 double yf = (temp_array[i]-t_axis[tt])/ ❷
(t_axis[tt+1]-t_axis[tt]); ❷
283 value_array[i] =
xf * yf *data(dd+1,tt+1) ❷
284 + (1.0-xf)* yf *data(dd, tt+1) ❷
285 + xf *(1.0-yf)*data(dd+1,tt) ❷
286 + (1.0-xf)*(1.0-yf)*data(dd, tt); ❷
287
288 }
289
290 return(value_array);
291 }PerfectHash/table.c
268 double *interpolate_bruteforce(int isize, int xstride,
int d_axis_size, int t_axis_size, double *d_axis, double *t_axis,
269 double *dens_array, double *temp_array, double *data)
270 {
271 int i;
272
273 double *value_array=(double *)malloc(isize*sizeof(double));
274
275 for (i = 0; i<isize; i++){
276 int tt, dd;
277
276 int tt, dd;
277
278 for (tt=0; tt<t_axis_size-2 && ❶
temp_array[i] > t_axis[tt+1]; tt++); ❶
279 for (dd=0; dd<d_axis_size-2 && ❶
dens_array[i] > d_axis[dd+1]; dd++); ❶
280
281 double xf = (dens_array[i]-d_axis[dd])/ ❷
(d_axis[dd+1]-d_axis[dd]); ❷
282 double yf = (temp_array[i]-t_axis[tt])/ ❷
(t_axis[tt+1]-t_axis[tt]); ❷
283 value_array[i] =
xf * yf *data(dd+1,tt+1) ❷
284 + (1.0-xf)* yf *data(dd, tt+1) ❷
285 + xf *(1.0-yf)*data(dd+1,tt) ❷
286 + (1.0-xf)*(1.0-yf)*data(dd, tt); ❷
287
288 }
289
290 return(value_array);
291 }
❶ Specifies a linear search from 0 to axis_size
两个轴的线性搜索在第 278 行和第 279 行完成。编码简单明了,从而实现了缓存友好的实现。现在让我们看看下面清单中的二分搜索。
The linear search of the two axes is done in lines 278 and 279. The coding is simple and straightforward, resulting in a cache-friendly implementation. Now let’s look at the bisection search in the following listing.
Listing 5.2 Bisection search algorithm in a table lookup
PerfectHash/table.c
293 double *interpolate_bisection(int isize, int xstride,
int d_axis_size, int t_axis_size, double *d_axis, double *t_axis,
294 double *dens_array, double *temp_array, double *data)
295 {
296 int i;
297
298 double *value_array=(double *)malloc(isize*sizeof(double));
299
300 for (i = 0; i<isize; i++){
301 int tt = bisection(t_axis, t_axis_size-2, ❶
temp_array[i]); ❶
302 int dd = bisection(d_axis, d_axis_size-2, ❶
dens_array[i]); ❶
303
304 double xfrac = (dens_array[i]-d_axis[dd])/ ❷
(d_axis[dd+1]-d_axis[dd]); ❷
305 double yfrac = (temp_array[i]-t_axis[tt])/ ❷
(t_axis[tt+1]-t_axis[tt]); ❷
306 value_array[i] =
xfrac * yfrac *data(dd+1,tt+1) ❷
307 + (1.0-xfrac)* yfrac *data(dd, tt+1) ❷
308 + xfrac *(1.0-yfrac)*data(dd+1,tt) ❷
309 + (1.0-xfrac)*(1.0-yfrac)*data(dd, tt); ❷
310 }
311
312 return(value_array);
313 }
314
315 int bisection(double *axis, int axis_size, double value)
316 {
317 int ibot = 0; ❸
318 int itop = axis_size+1; ❸
319
320 while (itop - ibot > 1){ ❸
321 int imid = (itop + ibot) /2; ❸
322 if ( value >= axis[imid] ) ❸
323 ibot = imid; ❸
324 else ❸
325 itop = imid; ❸
326 } ❸
327 return(ibot);
328 }PerfectHash/table.c
293 double *interpolate_bisection(int isize, int xstride,
int d_axis_size, int t_axis_size, double *d_axis, double *t_axis,
294 double *dens_array, double *temp_array, double *data)
295 {
296 int i;
297
298 double *value_array=(double *)malloc(isize*sizeof(double));
299
300 for (i = 0; i<isize; i++){
301 int tt = bisection(t_axis, t_axis_size-2, ❶
temp_array[i]); ❶
302 int dd = bisection(d_axis, d_axis_size-2, ❶
dens_array[i]); ❶
303
304 double xfrac = (dens_array[i]-d_axis[dd])/ ❷
(d_axis[dd+1]-d_axis[dd]); ❷
305 double yfrac = (temp_array[i]-t_axis[tt])/ ❷
(t_axis[tt+1]-t_axis[tt]); ❷
306 value_array[i] =
xfrac * yfrac *data(dd+1,tt+1) ❷
307 + (1.0-xfrac)* yfrac *data(dd, tt+1) ❷
308 + xfrac *(1.0-yfrac)*data(dd+1,tt) ❷
309 + (1.0-xfrac)*(1.0-yfrac)*data(dd, tt); ❷
310 }
311
312 return(value_array);
313 }
314
315 int bisection(double *axis, int axis_size, double value)
316 {
317 int ibot = 0; ❸
318 int itop = axis_size+1; ❸
319
320 while (itop - ibot > 1){ ❸
321 int imid = (itop + ibot) /2; ❸
322 if ( value >= axis[imid] ) ❸
323 ibot = imid; ❸
324 else ❸
325 itop = imid; ❸
326 } ❸
327 return(ibot);
328 }
等分代码比线性搜索(清单 5.1)略长,但它的操作复杂性应该较低。我们将在 5.5.1 节中查看其他表搜索算法,并在图 5.8 中显示它们的相对性能。
The bisection code is slightly longer than the linear search (listing 5.1), but it should have less operational complexity. We’ll look at other table search algorithms in section 5.5.1 and show their relative performance in figure 5.8.
剧透:正如您所期望的那样,等分搜索并不比线性搜索快多少,即使考虑到插值的成本也是如此。尽管情况如此,但此分析表明线性搜索也没有您预期的那么慢。
Spoiler: the bisection search is not much faster than the linear search as you might expect, even accounting for the cost of the interpolation. Although this is the case, this analysis shows the linear search is not as slow as you might expect either.
现在让我们再举一个日常生活中的例子来介绍一些并行算法的想法。第一个示例演示了无需比较且同步性较低的算法方法如何更容易实现,并且对于高度并行的硬件可以更好地执行。我们将在以下部分中讨论其他示例,这些示例重点介绍了空间局部性、可重复性和并行性的其他重要属性,然后总结了本章末尾第 5.8 节中的所有概念。
Now let’s take another example from everyday life to introduce some ideas for parallel algorithms. This first example demonstrates how an algorithmic approach that is comparison-free and less synchronous can be easier to implement and can perform better for highly parallel hardware. We discuss additional examples in the following sections that highlight spatial locality, reproducibility, and other important attributes for parallelism and then summarize all of the ideas in section 5.8 at the end of the chapter.
在本节中,我们将讨论哈希函数的重要性。哈希技术起源于 1950 年代和 60 年代,但适应许多应用领域的速度很慢。具体来说,我们将介绍什么是完美哈希、空间哈希、完美空间哈希,以及所有有前途的用例。
In this section, we will discuss the importance of a hash function. Hashing techniques originated in the 1950s and 60s, but have been slow to be adapted to many application areas. Specifically, we will go through what constitutes a perfect hash, spatial hashing, perfect spatial hashing, along with all the promising use cases.
哈希函数从键映射到值,就像字典使用单词作为其定义的查找键一样。在图 5.1 中,单词 Romero 是经过哈希处理以查找值的键,在本例中为名字对象或用户名。与物理词典不同,计算机至少需要 26 个可能的存储位置乘以词典键的最大长度。因此,对于计算机来说,绝对有必要将密钥编码为称为哈希的较短形式。术语 hash 或 hashing 是指将 key “切碎” 成更短的形式,以用作存储值的索引。用于存储特定键的值集合的位置称为 bucket 或 bin。有许多不同的方法可以从 key 生成哈希值;最好的方法通常是特定于问题的。
A hash function maps from a key to a value, much like a dictionary uses a word as the lookup key to its definition. In figure 5.1, the word Romero is the key that is hashed to look up the value, which in this case is the moniker or username. Unlike a physical dictionary, a computer needs at least 26 possible storage locations times the maximum length of the dictionary key. So for a computer, it is absolutely necessary to encode the key into a shorter form called a hash. The term hash or hashing refers to “chopping up” the key into a shorter form to use as an index to store the value. The location for storing the collection of values for a specific key is called a bucket or bin. There are many different ways to generate a hash from a key; the best approaches are generally problem-specific.
图 5.1 用于按姓氏查找计算机名字对象的哈希表。在 ASCII 中,R 为 82,O 为 79。然后我们可以计算出 82 - 64 + 26 + 79 - 4 = 59 的第一个哈希键。哈希表中存储的值是用户名,有时称为名字对象。
Figure 5.1 Hash table to lookup a computer moniker by last name. In ASCII, R is 82 and O is 79. We can then calculate the first hash key with 82 - 64 + 26 + 79 - 4 = 59. The value stored in the hash table is the username, sometimes called a moniker.
完美哈希是每个存储桶中最多有一个条目的哈希。完美哈希值很容易处理,但会占用更多内存。最小完美哈希是每个存储桶中只有一个条目,并且没有空存储桶。计算最小完美哈希值需要更长的时间,但例如,对于固定的编程关键字集,额外的时间是值得的。对于我们将在此处讨论的大多数哈希值,哈希值将动态创建、查询和丢弃,因此更快的创建时间比内存大小更重要。如果完美哈希不可行或占用太多内存,则可以采用紧凑哈希。紧凑哈希会压缩哈希,以便它需要更少的存储内存。与往常一样,不同的哈希方法在编程复杂性、运行时间和所需内存方面存在权衡。
A perfect hash is one where there is one entry in each bucket at most. Perfect hashes are simple to handle, but can take more memory. A minimal perfect hash is just one entry in each bucket and with no empty buckets. It takes longer to calculate minimal perfect hashes, but for example, for fixed sets of programming keywords, the extra time is worth it. For most of the hashes we’ll discuss here, the hashes will be created on the fly, queried, and thrown away, so a faster creation time is more important than memory size. Where a perfect hash is not feasible or takes too much memory, a compact hash can be employed. A compact hash compresses the hash so that it requires less storage memory. As always, there are tradeoffs in programming complexity, run time, and required memory among the different hashing methods.
负载因子是填充的哈希的分数。它由 n/k 计算,其中 n 是哈希表中的条目数,k 是存储桶数。紧凑哈希在 .8 到 .9 的负载因子下仍然有效,但之后由于冲突,效率会下降。当多个键希望将其值存储在同一个存储桶中时,就会发生冲突。拥有一个好的哈希函数很重要,它可以更均匀地分配键,避免条目的聚类,从而允许更高的负载因子。使用紧凑哈希时,会同时存储键和值,以便在检索时检查键以查看它是否是正确的条目。
The load factor is the fraction of the hash that is filled. It is computed by n/k, where n is the number of entries in the hash table and k is the number of buckets. Compact hashes still work at load factors of .8 to .9, but the efficiency drops off after that due to collisions. Collisions occur when more than one key wants to store its value in the same bucket. It is important to have a good hash function that distributes keys more uniformly, avoiding the clustering of entries, thereby allowing higher load factors. With a compact hash, both the key and the value are stored so that on retrieval, the key can be checked to see if it is the right entry.
在前面的示例中,我们使用姓氏的第一个字母作为简单的哈希键。虽然有效,但使用第一个字母肯定存在缺陷。一个是字母表中以每个字母开头的姓氏数量分布不均匀,导致每个存储桶中的条目数量不相等。我们可以改用字符串的整数表示形式,它会为名称的前四个字母生成哈希值。但是字符集为每个字节的 256 个存储位置仅提供 52 个可能的值,导致只给出了一小部分可能的整数键。仅需要字符的特殊哈希函数需要的存储位置要少得多。
In the previous examples, we used the first letter of the last name as a simple hash key. While effective, there are certainly flaws with using the first letter. One is that the number of last names starting with each letter in the alphabet is not evenly distributed, leading to unequal numbers of entries in each bucket. We could instead use the integer representation of the string, which produces a hash for the first four letters of the name. But the character set gives only 52 possible values for the 256 storage locations for each byte, leading to only a fraction of possible integer keys. A special hash function that expects only characters would need far fewer storage locations.
我们在第 1 章中的讨论使用了图 1.9 中 Krakatau 示例中的统一大小的规则网格。对于关于并行算法和空间哈希的讨论,我们需要使用更复杂的计算网格。在科学模拟中,更复杂的网格更详细地定义了我们感兴趣的区域。在大数据中,特别是图像分析和分类中,这些更复杂的网格并未得到广泛采用。然而,这项技术在那里将具有很大的价值;当图像中的单元格具有混合特征时,只需拆分该单元格即可。
Our discussion in chapter 1 used a uniform-sized, regular grid from the Krakatau example in figure 1.9. For this discussion on parallel algorithms and spatial hashing, we need to use more complex computational meshes. In scientific simulations, more complex meshes define with more detail the areas that we are interested in. In big data, specifically image analysis and categorization, these more complex meshes are not widely adopted. Yet the technique would have great value there; when a cell in the image has mixed characteristics, just split the cell.
使用更复杂的网格的最大障碍是编码变得更加复杂,我们必须采用新的计算技术。对于复杂的网格,找到在并行架构上工作和扩展良好的方法是一个更大的挑战。在本节中,我们将向您展示如何使用高度并行算法处理一些常见的空间操作。
The biggest impediment to using more complex meshes is that coding becomes more complicated and we must incorporate new computational techniques. For complex meshes, it is a greater challenge to find methods that work and scale well on parallel architectures. In this section, we’ll show you how you can handle some of the common spatial operations with highly parallel algorithms.
基于单元的自适应网格细化 (AMR) 属于一类非结构化网格技术,它不再具有结构化网格来定位数据的简单性。在基于单元的 AMR 中(图 5.2),单元数据数组是一维的,数据可以按任意顺序排列。网格位置包含在其他数组中,这些数组具有每个单元的大小和位置信息。因此,网格有一些结构,但数据是完全非结构化的。进一步进入非结构化领域,完全非结构化网格可以具有三角形、多面体或其他复杂形状的单元。这使得单元能够“适应”陆地和海洋之间的边界,但代价是更复杂的数值运算。由于许多相同的非结构化数据并行算法适用于两者,因此我们将主要使用基于单元格的 AMR 示例。
Cell-based adaptive mesh refinement (AMR) belongs to a class of unstructured mesh techniques that no longer have the simplicity of a structured grid to locate data. In cell-based AMR (figure 5.2), the cell data arrays are one-dimensional, and the data can be in any order. The mesh locations are carried along in additional arrays that have the size and location information for each cell. Thus, there is some structure to the grid, but the data is completely unstructured. Taking this further into unstructured territory, a fully-unstructured mesh could have cells of triangles, polyhedra, or other complex shapes. This allows the cells to “fit” the boundaries between land and ocean, but at the cost of more complex numerical operations. Because many of the same parallel algorithms for unstructured data apply to both, we’ll work mostly with the cell-based AMR example.
图 5.2 来自 CLAMR 迷你 App 的基于单元的 AMR 网格,用于波浪模拟。黑色方块是单元格,各种阴影的方块代表从右上角向外辐射的波的高度。
Figure 5.2 A cell-based AMR mesh for a wave simulation from the CLAMR mini-app. The black squares are the cells and the variously-shaded squares represent the height of a wave radiating outward from the upper right corner.
AMR 技术可以分为贴片、块和基于细胞的方法。patch 和 block 方法使用各种大小的 patchs 或固定大小的 block,它们至少可以部分利用这些单元组的常规结构。基于单元的 AMR 具有真正的非结构化数据,可以按任意顺序排列。2011 年,Davis、Nicholaeff 和 Trujillo 在洛斯阿拉莫斯国家实验室 (Los Alamos National Laboratory) 还是暑期学生时开发了一款浅水、基于细胞的 AMR 迷你应用程序 CLAMR (https://github.com/lanl/CLAMR.git)。他们想看看基于单元的 AMR 应用程序是否可以在 GPU 上运行。在此过程中,他们发现了突破性的并行算法,这些算法也使 CPU 实现运行得更快。其中最重要的是空间哈希。
AMR techniques can be broken down into patch, block, and cell-based approaches. The patch and block methods use various size patches or fixed-size blocks that can at least partially exploit the regular structure of these groups of cells. Cell-based AMR has truly unstructured data that can be in any order. A shallow-water, cell-based AMR mini-app, CLAMR (https://github.com/lanl/CLAMR.git), was developed by Davis, Nicholaeff, and Trujillo while they were summer students in 2011 at Los Alamos National Laboratory. They wanted to see if cell-based AMR applications could run on GPUs. In the process, they found breakthrough parallel algorithms that also made CPU implementations run faster. The most important of these was a spatial hash.
空间哈希是一种键基于空间信息的技术。哈希算法在每次查找时保留 Θ(1) 运算的相同平均算法复杂度。所有空间查询都可以使用空间哈希执行;许多方法比其他方法快得多。基本原理是将对象映射到以规则模式排列的桶网格上。
Spatial hashing is a technique where the key is based on spatial information. The hashing algorithm retains the same average algorithmic complexity of Θ(1) operations for each lookup. All spatial queries can be performed with a spatial hash; many are much faster than alternative methods. The basic principle is to map objects onto a grid of buckets arranged in a regular pattern.
空间哈希值显示在图 5.3 的中心。存储桶的大小是根据要映射的对象的特征大小来选择的。对于基于单元的 AMR 网格,使用最小单元大小。对于粒子或对象,如图右侧所示,像元大小基于交互距离。此选项意味着只需要查询紧邻的单元格以进行交互或碰撞计算。碰撞计算是空间哈希的重要应用领域之一,不仅在平滑粒子流体动力学、分子动力学和天体物理学的科学计算中,而且在游戏引擎和计算机图形学中也是如此。在许多情况下,我们可以利用空间局部性来降低计算成本。
A spatial hash is shown in the center of figure 5.3. The sizing of the buckets is selected based on the characteristic size of the objects to map. For a cell-based AMR mesh, the minimum cell size is used. For particles or objects, as shown on the right in the figure, the cell size is based on the interaction distance. This choice means that only the cells immediately adjacent need to be queried for interaction or collision calculations. Collision calculations are one of the great application areas for spatial hashes, not only in scientific computing for smooth particle hydrodynamics, molecular dynamics, and astrophysics, but also in gaming engines and computer graphics. There are many situations where we can exploit spatial locality to reduce computational costs.
图 5.3 映射到空间哈希的计算网格、粒子和对象。非结构化网格的多面体和基于单元的自适应细化网格的矩形单元可以映射到空间哈希以进行空间操作。粒子和几何对象还可以从映射到空间哈希中受益,以提供有关其空间位置的信息,以便只需考虑附近的项目。
Figure 5.3 Computational meshes, particles, and objects mapped onto a spatial hash. The polyhedra of the unstructured mesh and the rectangular cells of the cell-based adaptive refinement mesh can be mapped to a spatial hash for spatial operations. Particles and geometric objects can also benefit from being mapped to a spatial hash to provide information about their spatial locality so that only nearby items need to be considered.
图中左侧的 AMR 和非结构化网格都称为差分离散数据,因为单元更小,梯度更陡峭,可以更好地解析物理现象。但这些对细胞可以变小多少是有限的。此限制可防止存储桶大小变得太小。这两个网格都将其单元格索引存储在空间哈希的所有底层存储桶中。对于粒子和几何对象,粒子索引和对象标识符存储在存储桶中。这提供了一种局部性形式,可以防止计算成本随着问题大小的增加而增加。例如,如果问题域的左侧和顶部增加,则空间哈希右下角的交互计算将保持不变。因此,在粒子计算中,算法复杂度保持 Θ(N),而不是增长到 Θ(N 2)。下面的清单显示了交互循环的伪代码,它位于内部循环的附近位置,而不必搜索所有粒子。
Both the AMR and the unstructured mesh on the left in the figure are referred to as differential discretized data because the cells are smaller where the gradients are steeper to better resolve the physical phenomena. But these have a limit to how much smaller the cells can get. The limit keeps the bucket sizes from getting too small. Both meshes store their cell indices in all the underlying buckets of the spatial hash. For the particles and geometric objects, the particle indices and object identifiers are stored in the buckets. This provides a form of locality that keeps the computational cost from increasing as the problem size increases. For example, if the problem domain is increased on the left and top, the interaction calculation in the lower right of the spatial hash stays the same. The algorithmic complexity thus stays Θ(N ) for the particle calculations instead of growing to Θ(N 2). The following listing shows the pseudo code for the interaction loop, which is over nearby locations for the inner loop instead of having to search through all the particles.
Listing 5.3 Particle interaction pseudo-code
1 forall particles, ip, in NParticles{
2 forall particles, jp, in Adjacent_Buckets{
3 if (distance between particles < interaction_distance){
4 perform collision or interaction calculation
5 }
6 }
7 }1 forall particles, ip, in NParticles{
2 forall particles, jp, in Adjacent_Buckets{
3 if (distance between particles < interaction_distance){
4 perform collision or interaction calculation
5 }
6 }
7 }
我们首先要看完美哈希,以关注哈希的使用,而不是哈希内部的机制。这些方法都依赖于能够保证每个存储桶中只有一个条目,从而避免了在存储桶可能具有多个数据条目的情况下处理冲突的问题。为了实现完美哈希,我们将研究四个最重要的空间运算:
We’ll first look at perfect hashing to focus on the use of hashing rather than the mechanics internal to hashing. These methods all rely on being able to guarantee that there will be only one entry in each bucket, avoiding the issues of handling collisions where a bucket might have more than one data entry. For perfect hashing, we’ll investigate the four most important spatial operations:
Neighbor finding—Locating the one or two neighbors on each side of a cell
Table lookup—Locating the intervals in the 2D table to perform the interpolation
1D 和 2D 中四个操作的示例的所有源代码都可以在 https://github.com/lanl/PerfectHash.git 上根据开源许可证获得。源代码也链接到本章的示例中。完美的哈希代码使用 CMake 并测试 OpenCL 的可用性。如果您没有 OpenCL 功能,代码会检测到这一点,并且不会编译 OpenCL 实现。CPU 上的其余情况仍将运行。
All of the source code for the examples for the four operations in 1D and 2D is available at https://github.com/lanl/PerfectHash.git under an open source license. The source is also linked into the examples for this chapter. The perfect hash code uses CMake and tests for the availability of OpenCL. If you do not have OpenCL capability, the code detects that and will not compile the OpenCL implementations. The rest of the cases on the CPU will still run.
Neighbor finding using a spatial perfect hash
邻域查找是最重要的空间操作之一。在科学计算中,从一个单元中移出的材料必须移动到相邻的单元中。我们需要知道要移动到哪个单元,以便计算材料的数量并移动它。在图像分析中,相邻单元的特征可以提供有关当前单元组成的重要信息。
Neighbor finding is one of the most important spatial operations. In scientific computing, the material moved out of one cell has to move into the adjacent cell. We need to know which cell to move to in order to compute the amount of material and move it. In image analysis, the characteristics of the adjacent cell can give important information on the composition of the current cell.
CLAMR 中 AMR 网格的规则是,在单元的面上只能有单级细化跳跃。此外,每侧每个单元格的邻居列表只是其中一个相邻单元格,如图 5.4 所示,选择是较低的单元格或每对单元格左侧的单元格。通过使用第一个单元格的 neighbor 列表找到对中的第二个;例如,ntop[nleft[ic]]。然后问题就变成了为每个单元设置相邻数组。
The rules for the AMR mesh in CLAMR are that there can be only a single-level jump in refinement across a face of a cell. Also, the neighbor list of each cell on each side is just one of the neighbor cells, and the choice is to be the lower cell or the cell to the left of each pair as figure 5.4 shows. The second of the pair is found by using the neighbor list of the first cell; for example, ntop[nleft[ic]]. The problem then becomes setting up the neighbor arrays for every cell.
图 5.4 左邻域是两个像元中左侧的下部单元,下邻域是下面两个像元左侧的像元。同样,右邻域是两个像元中右侧的下部像元,而上邻域位于上方两个像元的左侧。
Figure 5.4 The left neighbor is the lower cell of the two to the left, and the bottom neighbor is the cell to the left of the two below. Similarly, the right neighbor is the lower cell of the two to the right, and the top neighbor is to the left of the two cells above.
查找邻居的可能算法有哪些?朴素的方法是在所有其他单元格中搜索相邻的单元格。这可以通过查看每个单元格中的 i、j 和 level 变量来完成。朴素算法是 O(N 2)。它在少量单元中表现良好,但运行时复杂性会迅速增加。一些常见的替代算法是基于树的,例如 k-D 树和四叉树算法(三维八叉树)。这些是基于比较的算法,缩放为 O(N log N ),稍后定义。2D 邻域计算的代码(包括 k-D 树、暴力破解、CPU 和 GPU 哈希实现)可在 https://github.com/lanl/PerfectHash.git 上获得,以及本章稍后讨论的其他空间完美哈希应用程序。
What are the possible algorithms for finding neighbors? The naive way is to search all the other cells for the cell that is adjacent. This can be done by looking at the i, j, and level variables in each cell. The naive algorithm is O(N 2). It performs well with small numbers of cells, but the run-time complexity grows large quickly. Some common alternative algorithms are tree-based, such as the k-D tree and quadtree algorithms (octree in three dimensions). These are comparison-based algorithms that scale as O(N log N ), which are defined later. The code for the 2D neighbor calculation, including the k-D tree, brute force, CPU, and GPU hash implementations is available at https://github.com/lanl/PerfectHash.git, along with the other spatial perfect hash applications discussed later in this chapter.
k-D 树在 x 维度上将网格拆分为相等的两半,然后在 y 维度上将网格拆分为相等的两半,重复直到找到对象。构建 k-D 树的算法是 O(N log N ),每次搜索也是 O(N log N )。
The k-D tree splits the mesh into two equal halves in the x-dimension and then two equal halves in the y-dimension, repeating until it finds the object. The algorithm to build the k-D tree is O(N log N ), and each search is also O(N log N ).
四叉树的每个父级有四个子级,每个象限一个。这正好映射到基于单元的 AMR 网格的细分。完整的四叉树从顶部或根开始,有一个单元,然后细分到 AMR 网格的最精细级别。“截断”四叉树从网格的最粗糙级别开始,每个粗化单元都有一个四叉树,以向下映射到最精细的级别。四叉树算法是一种基于比较的算法: O(N log N )。
The quadtree has four children for each parent, one for each quadrant. This exactly maps to the subdivision of the cell-based AMR mesh. A full quadtree starts from the top, or root, with one cell and subdivides to the finest level of the AMR mesh. A “truncated” quadtree starts from the coarsest level of the mesh and has a quadtree for each coarse cell to map down to the finest level. The quadtree algorithm is a comparison-based algorithm: O(N log N ).
在面上仅进行一次级别跳跃的限制称为分级网格。在基于单元的 AMR 中,渐变网格很常见,但其他四叉树应用程序(例如天体物理学中的 n 体应用程序)会导致四叉树数据结构中的跳跃要大得多。细化的一级跳跃使我们能够改进用于查找邻居的算法设计。我们可以从代表我们单元的叶子开始搜索,最多只需要爬上树的两层就可以找到我们的邻居。要搜索大小相似的近邻,搜索应从叶子开始并使用四叉树。对于大型不规则对象的搜索,应使用 k-D 树,并且搜索应从树的根开始。正确使用基于树的搜索算法可以在 CPU 上提供可行的实现,但在 GPU 上进行比较和树构造存在困难,在 GPU 上无法轻松完成工作组之外的比较。
The limitation of just one level jump across a face is called a graded mesh. In cell-based AMR, graded meshes are common, but other quadtree applications such as n -body applications in astrophysics result in much larger jumps in the quadtree data structure. The one-level jump in refinement allows us to improve the algorithm design for finding neighbors. We can start our search at the leaf that represents our cell and, at most, we only have to go up two levels of the tree to find our neighbor. For searching for a near neighbor of similar size, the search should start at the leaves and use a quadtree. For searches for large irregular objects, the k-D tree should be used and the search should start from the root of the tree. Proper use of tree-based search algorithms can provide a viable implementation on CPUs, but the comparisons and tree construction present difficulties on GPUs, where comparisons beyond the work group cannot be done easily.
这为设计空间哈希以执行邻域查找操作奠定了基础。我们可以通过使哈希中的桶与 AMR 网格中最小的单元的大小来保证我们的空间哈希中没有冲突。然后,算法变为
This sets the stage for the design of a spatial hash to perform the neighbor finding operation. We can guarantee that there are no collisions in our spatial hash by making the buckets in the hash the size of the finest cells in the AMR mesh. The algorithm then becomes
Allocate a spatial hash the size of the finest level of the cell-based AMR mesh
For each cell in the AMR mesh, write the cell number to the hash buckets underlying the cell
Compute the index for a finer cell one cell outside the current cell on each side
对于图 5.5 所示的网格,写入阶段之后是读取阶段,用于查找右相邻单元的索引。
For the mesh shown in figure 5.5, the write phase is followed by a read phase to look up the index of the right neighbor cell.
Figure 5.5 Finding the right neighbor of cell 21 using a spatial perfect hash
该算法非常适合 GPU,如清单 5.5 所示。第一次实现从 CPU 移植到 GPU 的时间不到一天。最初的 k-D 树需要数周或数月的时间才能在 GPU 上实现。算法复杂度也打破了 O(log N ) 阈值,平均而言,N 个细胞为 Θ(N)。
This algorithm is well suited to GPUs and is shown in listing 5.5. The first implementation took less than a day to port from the CPU to the GPU. The original k-D tree would take weeks or months to implement on the GPU. The algorithmic complexity also breaks the O(log N ) threshold and is, on average, Θ(N) for N cells.
完美哈希邻域计算的第一次实现在 CPU 上比 k-D 树方法快一个数量级,在 GPU 上比 CPU 的单个内核快一个数量级,总共加速了 3,157 倍(图 5.6)。算法性能研究是在标称时钟频率为 2.30 GHz 的 NVIDIA V100 GPU 和 Skylake Gold 5118 CPU 上完成的。本章中的所有结果也都使用了这种架构。CPU 内核和 GPU 架构是 2018 年左右可用的最佳架构,给出了 Best(2018) 并行加速比比较(有关加速比表示法,请参阅第 1.6 节)。但它不是 CPU 和 GPU 之间的架构比较。如果利用了此 CPU 上的 24 个虚拟内核,则 CPU 也将看到相当大的并行加速。
This first implementation of the perfect hash neighbor calculation was an order of magnitude faster on the CPU than the k-D tree method, and an additional order of magnitude faster on the GPU than a single core of a CPU for a total of 3,157x speedup (figure 5.6). The algorithm performance study was done on an NVIDIA V100 GPU and a Skylake Gold 5118 CPU with a nominal clock frequency of 2.30 GHz. All the results in this chapter used this architecture as well. The CPU core and GPU architecture are the best available around 2018, giving a Best(2018) parallel speedup comparison (see section 1.6 for speedup notation). But it isn’t an architecture comparison between the CPU and the GPU. If the 24 virtual cores on this CPU were utilized, the CPU would also see a considerable parallel speedup.
图 5.6 算法和并行加速总计 3,157 倍。新算法可在 GPU 上实现并行加速。
Figure 5.6 The algorithm and parallel speedup total 3,157x. The new algorithm enables the parallel speedup on the GPU.
为这种性能编写代码有多难?让我们看一下清单 5.4 中 CPU 的哈希表的代码。例程的输入是 1D 数组 i、j 和 level,其中 level 是细化级别,i 和 j 是网格中该单元细化级别的单元的行和列。整个列表大约有十几行。
How hard is it to write the code for this kind of performance? Let’s take a look at the code for the hash table in listing 5.4 for the CPU. The input to the routine are the 1D arrays, i, j, and level, where level is the refinement level, and i and j are the row and column of the cell in the mesh at that cell’s refinement level. The whole listing is about a dozen lines.
Listing 5.4 Writing out a spatial hash table for the CPU
neigh2d.c from PerfectHash 452 int *levtable = (int *)malloc(levmx+1); ❶ 453 for (int lev=0; lev<levmx+1; lev++) ❶ levtable[lev] = (int)pow(2,lev); ❶ 454 455 int jmaxsize = mesh_size*levtable[levmx]; ❷ 456 int imaxsize = mesh_size*levtable[levmx]; ❷ 457 int **hash = (int **)genmatrix(jmaxsize, ❸ imaxsize, sizeof(int)); ❸ 458 459 for(int ic=0; ic<ncells; ic++){ ❹ 460 int lev = level[ic]; 461 for (int jj=j[ic]*levtable[levmx-lev]; jj<(j[ic]+1)*levtable[levmx-lev]; jj++) { 462 for (int ii=i[ic]*levtable[levmx-lev]; ii<(i[ic]+1)*levtable[levmx-lev]; ii++) { 463 hash[jj][ii] = ic; 464 } 465 } 466 }
neigh2d.c from PerfectHash 452 int *levtable = (int *)malloc(levmx+1); ❶ 453 for (int lev=0; lev<levmx+1; lev++) ❶ levtable[lev] = (int)pow(2,lev); ❶ 454 455 int jmaxsize = mesh_size*levtable[levmx]; ❷ 456 int imaxsize = mesh_size*levtable[levmx]; ❷ 457 int **hash = (int **)genmatrix(jmaxsize, ❸ imaxsize, sizeof(int)); ❸ 458 459 for(int ic=0; ic<ncells; ic++){ ❹ 460 int lev = level[ic]; 461 for (int jj=j[ic]*levtable[levmx-lev]; jj<(j[ic]+1)*levtable[levmx-lev]; jj++) { 462 for (int ii=i[ic]*levtable[levmx-lev]; ii<(i[ic]+1)*levtable[levmx-lev]; ii++) { 463 hash[jj][ii] = ic; 464 } 465 } 466 }
❶ Constructs a table of powers of two (1, 2, 4, ...)
❷ Sets the number of rows and columns at the finest level
第 459、461 和 462 行的循环引用一维数组 i、j 和 level;level 是细化级别,其中 0 是粗略级别,1 到 levmax 是细化级别。数组 i 和 j 是网格中该单元细化级别的行和列。
The loops at lines 459, 461, and 462 reference the 1D arrays i, j, and level; level is the refinement level where 0 is the coarse level and 1 to levmax are the levels of refinement. The arrays i and j are the row and column of the cell in the mesh at that cell’s refinement level.
清单 5.5 显示了在 OpenCL 中为 GPU 写出空间哈希的代码,这与清单 5.4 类似。虽然我们还没有介绍 OpenCL,但 GPU 代码的简单性是显而易见的,即使不了解所有细节。让我们做一个简短的比较,以了解 GPU 的代码必须如何变化。我们定义了一个宏来处理 2D 索引,并使代码看起来更像 CPU 版本。那么最大的区别就是没有 cell loop。这是典型的 GPU 代码,其中外部循环被删除,而是由内核启动处理。cell 索引通过调用 get_global_id 内部函数为每个线程提供。第 12 章将详细介绍此示例和编写 OpenCL 代码。
Listing 5.5 shows the code for writing out the spatial hash in OpenCL for the GPU, which is similar to listing 5.4. Although we haven’t covered OpenCL yet, the simplicity of the GPU code is clear, even without understanding all the details. Let’s do a brief comparison to get a sense of how code has to change for the GPU. We define a macro to handle the 2D indexing and to make the code look more like the CPU version. Then the biggest difference is that there is no cell loop. This is typical of GPU code, where the outer loops are removed and are instead handled by the kernel launch. The cell index is provided for each thread by a call to the get_global_id intrinsic. There will be more on this example and writing OpenCL code, in general, in chapter 12.
清单 5.5 在 OpenCL 的 GPU 上写出空间哈希表
Listing 5.5 Writing out a spatial hash table on the GPU in OpenCL
neigh2d_kern.cl from PerfectHash 77 #define hashval(j,i) hash[(j)*imaxsize+(i)] 78 79 __kernel void hash_setup_kern( 80 const uint isize, 81 const uint mesh_size, 82 const uint levmx, 83 __global const int *levtable, ❶ 84 __global const int *i, ❶ 85 __global const int *j, ❶ 86 __global const int *level, ❶ 87 __global int *hash 88 ) { 89 90 const uint ic = get_global_id(0); ❷ 91 if (ic >= isize) return; ❸ 92 93 int imaxsize = mesh_size*levtable[levmx]; 94 int lev = level[ic]; 95 int ii = i[ic]; 96 int jj = j[ic]; 97 int levdiff = levmx - lev; 98 99 int iimin = ii *levtable[levdiff]; ❹ 100 int iimax = (ii+1)*levtable[levdiff]; ❹ 101 int jjmin = jj *levtable[levdiff]; ❹ 102 int jjmax = (jj+1)*levtable[levdiff]; ❹ 103 104 for ( int jjj = jjmin; jjj < jjmax; jjj++) { 105 for (int iii = iimin; iii < iimax; iii++) { 106 hashval(jjj, iii) = ic; ❺ 107 } 108 } 109 }
neigh2d_kern.cl from PerfectHash 77 #define hashval(j,i) hash[(j)*imaxsize+(i)] 78 79 __kernel void hash_setup_kern( 80 const uint isize, 81 const uint mesh_size, 82 const uint levmx, 83 __global const int *levtable, ❶ 84 __global const int *i, ❶ 85 __global const int *j, ❶ 86 __global const int *level, ❶ 87 __global int *hash 88 ) { 89 90 const uint ic = get_global_id(0); ❷ 91 if (ic >= isize) return; ❸ 92 93 int imaxsize = mesh_size*levtable[levmx]; 94 int lev = level[ic]; 95 int ii = i[ic]; 96 int jj = j[ic]; 97 int levdiff = levmx - lev; 98 99 int iimin = ii *levtable[levdiff]; ❹ 100 int iimax = (ii+1)*levtable[levdiff]; ❹ 101 int jjmin = jj *levtable[levdiff]; ❹ 102 int jjmax = (jj+1)*levtable[levdiff]; ❹ 103 104 for ( int jjj = jjmin; jjj < jjmax; jjj++) { 105 for (int iii = iimin; iii < iimax; iii++) { 106 hashval(jjj, iii) = ic; ❺ 107 } 108 } 109 }
❶ Passes in the table of powers of 2 along with i, j, and level
❷ 跨单元的循环由 GPU 内核暗示;每个线程都是一个单元。
❷ The loop across the cells is implied by the GPU kernel; each thread is a cell.
❸ The return is important to avoid reading past end of the arrays.
❹ Computes the bounds of the underlying hash buckets to set
❺ Sets the hash table value to the thread ID (the cell number)
检索相邻索引的代码也很简单,如清单 5.6 所示,只需在单元格之间循环,并在相邻位置位于网格的最精细级别上读取哈希表。您可以通过在所需方向上将行或列增加一个单元格来查找相邻对象的位置。对于左邻或下邻,增量为 1,而对于右邻或上邻,增量是网格在 x 方向或 imaxsize 上的全宽。
The code for retrieving the neighbor indexes is also simple as shown in listing 5.6, with just a loop across the cells and a read of the hash table at where the neighbor location would be on the finest level of the mesh. You can find the locations of the neighbors by incrementing the row or column by one cell in the direction needed. For the left or bottom neighbor, the increment is 1, while for the right or top neighbor, the increment is the full width of the mesh in the x-direction or imaxsize.
Listing 5.6 Finding neighbors from a spatial hash table on the CPU
neigh2d.c from PerfectHash
472 for (int ic=0; ic<ncells; ic++){
473 int ii = i[ic];
474 int jj = j[ic];
475 int lev = level[ic];
476 int levmult = levtable[levmx-lev];
477 int nlftval =
hash[ jj *levmult ] ❶
[MAX( ii *levmult-1,0 )]; ❶
478 int nrhtval =
hash[ jj *levmult ] ❶
[MIN((ii+1)*levmult, imaxsize-1)]; ❶
480 int nbotval =
hash[MAX( jj *levmult-1,0) ] ❶
[ ii *levmult ]; ❶
481 int ntopval =
hash[MIN((jj+1)*levmult, jmaxsize-1)] ❶
[ ii *levmult ]; ❶
482 neigh2d[ic].left = nlftval; ❷
483 neigh2d[ic].right = nrhtval; ❷
484 neigh2d[ic].bot = nbotval; ❷
485 neigh2d[ic].top = ntopval; ❷
486}neigh2d.c from PerfectHash
472 for (int ic=0; ic<ncells; ic++){
473 int ii = i[ic];
474 int jj = j[ic];
475 int lev = level[ic];
476 int levmult = levtable[levmx-lev];
477 int nlftval =
hash[ jj *levmult ] ❶
[MAX( ii *levmult-1,0 )]; ❶
478 int nrhtval =
hash[ jj *levmult ] ❶
[MIN((ii+1)*levmult, imaxsize-1)]; ❶
480 int nbotval =
hash[MAX( jj *levmult-1,0) ] ❶
[ ii *levmult ]; ❶
481 int ntopval =
hash[MIN((jj+1)*levmult, jmaxsize-1)] ❶
[ ii *levmult ]; ❶
482 neigh2d[ic].left = nlftval; ❷
483 neigh2d[ic].right = nrhtval; ❷
484 neigh2d[ic].bot = nbotval; ❷
485 neigh2d[ic].top = ntopval; ❷
486}
❶ 计算查询的相邻单元格位置,使用 max/min 将其保持在边界内
❶ Calculates the neighbor cell location for the query, using a max/min to keep it in bounds
❷ Assigns the neighbor value for output arrays
对于 GPU,我们再次删除 cells 的循环并将其替换为 get_ global_id 调用,如下面的清单所示。
For the GPU, we again remove the loop for the cells and replace it with a get_ global_id call as shown in the following listing.
列表 5.7 从 OpenCL 中 GPU 上的空间哈希表中查找邻居
Listing 5.7 Finding neighbors from a spatial hash table on the GPU in OpenCL
neigh2d_kern.cl from PerfectHash
113 #define hashval(j,i) hash[(j)*imaxsize+(i)]
114
115 __kernel void calc_neighbor2d_kern(
116 const int isize,
117 const uint mesh_size,
118 const int levmx,
119 __global const int *levtable,
120 __global const int *i,
121 __global const int *j,
122 __global const int *level,
123 __global const int *hash,
124 __global struct neighbor2d *neigh2d
125 ) {
126
127 const uint ic = get_global_id(0); ❶
128 if (ic >= isize) return;
129
130 int imaxsize = mesh_size*levtable[levmx];
131 int jmaxsize = mesh_size*levtable[levmx];
132
133 int ii = i[ic]; ❷
134 int jj = j[ic];
135 int lev = level[ic];
136 int levmult = levtable[levmx-lev];
137
138 int nlftval = hashval( jj *levmult ,
max( ii *levmult-1,0 ));
139 int nrhtval = hashval( jj *levmult ,
min((ii+1)*levmult, imaxsize-1));
140 int nbotval = hashval(max( jj *levmult-1,0) ,
ii *levmult );
141 int ntopval = hashval(min((jj+1)*levmult, jmaxsize-1),
ii *levmult );
142 neigh2d[ic].left = nlftval;
143 neigh2d[ic].right = nrhtval;
144 neigh2d[ic].bottom = nbotval;
145 neigh2d[ic].top = ntopval;
146 }neigh2d_kern.cl from PerfectHash
113 #define hashval(j,i) hash[(j)*imaxsize+(i)]
114
115 __kernel void calc_neighbor2d_kern(
116 const int isize,
117 const uint mesh_size,
118 const int levmx,
119 __global const int *levtable,
120 __global const int *i,
121 __global const int *j,
122 __global const int *level,
123 __global const int *hash,
124 __global struct neighbor2d *neigh2d
125 ) {
126
127 const uint ic = get_global_id(0); ❶
128 if (ic >= isize) return;
129
130 int imaxsize = mesh_size*levtable[levmx];
131 int jmaxsize = mesh_size*levtable[levmx];
132
133 int ii = i[ic]; ❷
134 int jj = j[ic];
135 int lev = level[ic];
136 int levmult = levtable[levmx-lev];
137
138 int nlftval = hashval( jj *levmult ,
max( ii *levmult-1,0 ));
139 int nrhtval = hashval( jj *levmult ,
min((ii+1)*levmult, imaxsize-1));
140 int nbotval = hashval(max( jj *levmult-1,0) ,
ii *levmult );
141 int ntopval = hashval(min((jj+1)*levmult, jmaxsize-1),
ii *levmult );
142 neigh2d[ic].left = nlftval;
143 neigh2d[ic].right = nrhtval;
144 neigh2d[ic].bottom = nbotval;
145 neigh2d[ic].top = ntopval;
146 }
❶ Gets the cell ID for the thread
❷ The rest of code is similar to the CPU version.
将此代码的简单性与 CPU 的 k-D 树代码进行比较,后者有一千行长!
Compare the simplicity of this code to the k-D tree code for the CPU, which is a thousand lines long!
Remap calculations using a spatial perfect hash
另一个重要的数值网格操作是从 1 个网格重新映射到 2 个网格。快速重映射可以允许在针对其具体需求优化的网格上执行不同的物理。
Another important numerical mesh operation is a remap from one mesh to another. Fast remaps can permit different physics to be performed on meshes optimized for their individual needs.
在这种情况下,我们将研究将值从一个基于单元的 AMR 网格重新映射到另一个基于单元的 AMR 网格。网格重映射也可能涉及非结构化网格或基于粒子的模拟,但技术更复杂。设置阶段与相邻情况相同,其中每个像元的像元索引都写入空间哈希。在这种情况下,将为源网格创建空间哈希。然后,读取阶段(如清单 5.8 所示)查询目标网格每个单元底层的单元编号的空间哈希,并在调整单元的大小差异后将源网格中的值汇总到目标网格中。对于此演示,我们简化了 https://github.com/Essentials ofParallelComputing/Chapter5.git 中的示例中的源代码。
In this case, we will look at remapping the values from one cell-based AMR mesh to another cell-based AMR mesh. Mesh remaps can also involve unstructured meshes or particle-based simulations, but the techniques are more complicated. The setup phase is identical to the neighbor case, where the cell index for every cell is written to the spatial hash. In this case, the spatial hash is created for the source mesh. Then the read phase, shown in listing 5.8, queries the spatial hash for the cell numbers underlying each cell of the target mesh and sums up the values from the source mesh into the target mesh after adjusting for the size difference of the cells. For this demonstration, we have simplified the source code from the example at https://github.com/Essentials ofParallelComputing/Chapter5.git.
Listing 5.8 The read phase of the remapping of a value on the CPU
remap2.c from PerfectHash
211 for(int jc = 0; jc < ncells_target; jc++) {
212 int ii = mesh_target.i[jc]; ❶
213 int jj = mesh_target.j[jc]; ❶
214 int lev = mesh_target.level[jc]; ❶
215 int lev_mod = two_to_the(levmx - lev);
216 double value_sum = 0.0;
217 for(int jjj = jj*lev_mod; ❷
jjj < (jj+1)*lev_mod; jjj++) { ❷
218 for(int iii = ii*lev_mod; ❷
iii < (ii+1)*lev_mod; iii++) { ❷
219 int ic = hash_table[jjj*i_max+iii];
220 value_sum += value_source[ic] / ❸
(double)four_to_the( ❸
levmx-mesh_source.level[ic] ❸
); ❸
221 }
222 }
223 value_remap[jc] += value_sum;
224 }remap2.c from PerfectHash
211 for(int jc = 0; jc < ncells_target; jc++) {
212 int ii = mesh_target.i[jc]; ❶
213 int jj = mesh_target.j[jc]; ❶
214 int lev = mesh_target.level[jc]; ❶
215 int lev_mod = two_to_the(levmx - lev);
216 double value_sum = 0.0;
217 for(int jjj = jj*lev_mod; ❷
jjj < (jj+1)*lev_mod; jjj++) { ❷
218 for(int iii = ii*lev_mod; ❷
iii < (ii+1)*lev_mod; iii++) { ❷
219 int ic = hash_table[jjj*i_max+iii];
220 value_sum += value_source[ic] / ❸
(double)four_to_the( ❸
levmx-mesh_source.level[ic] ❸
); ❸
221 }
222 }
223 value_remap[jc] += value_sum;
224 }
❶ Gets the location of the target mesh cell
❷ Queries the spatial hash for source mesh cells
❸ Sums the values from the source mesh, adjusting for relative cell sizes
图 5.7 显示了使用空间完美哈希的重新映射后的性能改进。由于算法而有一个加速,然后在 GPU 上运行额外的并行加速,总加速速度提高了 1,000 倍以上。GPU 上的并行加速是通过 GPU 上算法实现的便利性实现的。在多核处理器上也应该可以实现良好的并行加速。
Figure 5.7 shows the performance improvement from the remap using the spatial perfect hash. There is a speedup due to the algorithm and then an additional parallel speedup running on the GPU for a total speedup of over 1,000 times faster. The parallel speedup on the GPU is made possible by the ease of the algorithm implementation on the GPU. Good parallel speedup should also be possible on the multi-core processor as well.
图 5.7 由于算法从 k-D 树更改为 CPU 单个内核上的哈希值,然后移植到 GPU 以实现并行加速,因此 remap 算法的加速。
Figure 5.7 The speedup of the remap algorithm due to the change of the algorithm from a k-D tree to a hash on a single core of the CPU and then ported to the GPU for a parallel speedup.
Table lookups using a spatial perfect hash
从表格数据中查找值的操作提供了一种不同的局部性,空间哈希可以利用该局部性。您可以使用哈希在两个轴上搜索插值的间隔。在此示例中,我们使用了状态方程值的 51x23 查找表。这两个轴是 density 和 temperature,每个轴上的值之间使用相等的间距。我们将使用 n 表示轴的长度,使用 N 表示要执行的表查找次数。在这项研究中,我们使用了三种算法:
The operation of looking up values from tabular data presents a different kind of locality that can be exploited by a spatial hash. You can use hashing for searching for the intervals on both axes for the interpolation. For this example, we used a 51x23 lookup table of equation-of-state values. The two axes are density and temperature, with an equal spacing used between values on each axis. We will use n for the length of the axis and N for the number of table lookups that are to be performed. We used three algorithms in this study:
第一种是从第一列和第一行开始的线性搜索(暴力)。暴力破解应该是每个数据查询或所有 N 的 O(n) 算法,O(N * n),其中 n 分别是每个轴的列数或行数。
The first is a linear search (brute force) starting at the first column and row. The brute force should be an O(n) algorithm for each data query or for all N, O(N * n), where n is the number of columns or rows, respectively, for each axis.
第二种是等分搜索,它查看可能范围的中点值,并以递归方式缩小区间的位置。对于每个数据查询,等分搜索应该是 O(log n) 算法。
The second is a bisection search that looks at the midpoint value of the possible range and recursively narrows the location for the interval. The bisection search should be an O(log n) algorithm for each data query.
最后,我们使用哈希对每个轴的区间进行 O(1) 查找。我们测量了 CPU 和 GPU 的单个内核上的哈希性能。测试代码在两个轴上搜索间隔,并从表中搜索数据值的简单插值以获得结果。
Finally, we used a hash to do an O(1) lookup of the interval for each axis. We measured the performance of the hash on both a single core of a CPU and a GPU. The test code searches for the interval on both axes and a simple interpolation of the data values from the table to get the result.
图 5.8 显示了不同算法的性能结果。结果有一些惊喜。等分搜索并不比蛮力(线性搜索)快,尽管它是 O(N log n) 算法而不是 O(N *n) 算法。这似乎与简单的性能模型相反,该模型表明每个轴上的搜索速度应为 4 -5 倍。使用插值法,我们仍然预计会有大约 2 倍的改进。但是有一个简单的解释,你可以从我们在 5.2 节的讨论中猜到。
Figure 5.8 shows the performance results for the different algorithms. The results have some surprises. The bisection search is no faster than the brute force (linear search) despite being an O(N log n) algorithm instead of an O(N *n) algorithm. This seems to be contrary to the simple performance model, which indicates that the speedup should be 4 -5x for the search on each axis. With the interpolation, we’d still expect around a 2x improvement. But there is a simple explanation, which you might guess from our discussions in section 5.2.
图 5.8 用于表查找的算法显示 GPU 上的哈希算法加速很大。
Figure 5.8 The algorithms used for table lookup show a large speedup for the hash algorithm on the GPU.
在每个轴上搜索间隔最多只需要一个轴上的两个缓存加载,另一个轴上四个缓存加载即可进行线性搜索!二等分需要相同数量的缓存加载。通过考虑缓存负载,我们预计性能不会有差异。哈希算法可以直接转到正确的间隔,但它仍然需要缓存加载。缓存负载的减少大约是 3 倍。额外的改进可能是由于哈希算法的条件减少。一旦我们包括缓存层次结构的影响,观察到的性能与预期一致。
The search for the interval on each axis requires, at most, only two cache loads on one axis and four on the other for the linear search! The bisection would need the same number of cache loads. By considering cache loads, we would expect no difference in performance. The hash algorithm could directly go to the correct interval, but it would still need a cache load. The reduction in cache loads would be about a factor of 3x. The additional improvement is probably due to the reduction in the conditionals for the hash algorithm. The observed performance is in line with the expectations once we include the effect of the cache hierarchy.
将算法移植到 GPU 稍微复杂一些,并显示在此过程中可能的性能增强。为了理解做了什么,让我们首先看一下清单 5.9 中 CPU 上的 hash 实现。该代码遍历所有 1600 万个值,查找每个轴上的间隔,然后对表中的数据进行插值以获取结果值。通过使用哈希技术,我们可以通过使用没有条件的简单算术表达式来找到区间位置。
Porting the algorithm to the GPU is a bit more involved and shows what performance enhancements are possible in the process. To understand what was done, let’s first look at the hash implementation on the CPU in listing 5.9. The code loops over all of the 16 million values, finding the intervals on each axis, and then interpolates the data in the table to get the resulting value. By using the hashing technique, we can find the interval locations by using a simple arithmetic expression with no conditionals.
Listing 5.9 The table interpolation code for the CPU
table.c from PerfectHash
272 double dens_incr =
(d_axis[50]-d_axis[0])/50.0; ❶
273 double temp_incr =
(t_axis[22]-t_axis[0])/22.0; ❶
274
275 for (int i = 0; i<isize; i++){
276 int tt = (temp[i]-t_axis[0])/temp_incr; ❷
277 int dd = (dens[i]-d_axis[0])/dens_incr; ❷
278
279 double xf = (dens[i]-d_axis[dd])/ ❷
280 (d_axis[dd+1]-d_axis[dd]); ❷
281 double yf = (temp[i]-t_axis[tt])/ ❷
282 (t_axis[tt+1]-t_axis[tt]); ❷
283 value_array[i] =
xf * yf *data(dd+1,tt+1) ❸
284 + (1.0-xf)* yf *data(dd, tt+1) ❸
285 + xf *(1.0-yf)*data(dd+1,tt) ❸
286 + (1.0-xf)*(1.0-yf)*data(dd, tt); ❸
287 }table.c from PerfectHash
272 double dens_incr =
(d_axis[50]-d_axis[0])/50.0; ❶
273 double temp_incr =
(t_axis[22]-t_axis[0])/22.0; ❶
274
275 for (int i = 0; i<isize; i++){
276 int tt = (temp[i]-t_axis[0])/temp_incr; ❷
277 int dd = (dens[i]-d_axis[0])/dens_incr; ❷
278
279 double xf = (dens[i]-d_axis[dd])/ ❷
280 (d_axis[dd+1]-d_axis[dd]); ❷
281 double yf = (temp[i]-t_axis[tt])/ ❷
282 (t_axis[tt+1]-t_axis[tt]); ❷
283 value_array[i] =
xf * yf *data(dd+1,tt+1) ❸
284 + (1.0-xf)* yf *data(dd, tt+1) ❸
285 + xf *(1.0-yf)*data(dd+1,tt) ❸
286 + (1.0-xf)*(1.0-yf)*data(dd, tt); ❸
287 }
❶ Computes a constant increment for each axis data lookup
❷ Determines the interval for interpolation and the fraction in the interval
❸ Bi-linear interpolation to fill the value_array with the results
我们可以简单地将其移植到 GPU 上,就像在早期情况下所做的那样,方法是删除 for 循环并将其替换为对 get_global_id 的调用。但 GPU 有一个小的本地内存缓存,每个工作组共享该缓存,可以保存大约 4000 个双精度值。表中有 1,173 个值和 51+23 个轴值。这些可以放入本地内存缓存中,该缓存可以快速访问并在工作组中的所有线程之间共享。清单 5.10 中的代码显示了如何做到这一点。代码的第一部分使用所有线程以协作方式将数据值加载到本地内存中。然后需要进行同步,以保证在进入插值内核之前加载所有数据。其余代码看起来与清单 5.9 中的 CPU 代码非常相似。
We could simply port this to the GPU as was done in the earlier cases by removing the for loop and replacing it with a call to get_global_id. But the GPU has a small local memory cache that is shared by each work group, which can hold about 4,000 double-precision values. We have 1,173 values in the table and 51+23 axis values. These can fit in the local memory cache that can be accessed quickly and shared among all the threads in the workgroup. The code in listing 5.10 shows how this is done. The first part of the code cooperatively loads the data values into local memory using all of the threads. A synchronization is then required to guarantee that all the data is loaded before moving on to the interpolation kernel. The remaining code looks much the same as the code for the CPU in listing 5.9.
Listing 5.10 The table interpolation code in OpenCL for the GPU
table_kern.cl from PerfectHash
45 #define dataval(x,y) data[(x)+((y)*xstride)]
46
47 __kernel void interpolate_kernel(
48 const uint isize,
49 const uint xaxis_size,
50 const uint yaxis_size,
51 const uint dsize,
52 __global const double *xaxis_buffer,
53 __global const double *yaxis_buffer,
54 __global const double *data_buffer,
55 __local double *xaxis,
56 __local double *yaxis,
57 __local double *data,
58 __global const double *x_array,
59 __global const double *y_array,
60 __global double *value
61 )
62 {
63 const uint tid = get_local_id(0);
64 const uint wgs = get_local_size(0);
65 const uint gid = get_global_id(0);
66
67 if (tid < xaxis_size)
xaxis[tid]=xaxis_buffer[tid]; ❶
68 if (tid < yaxis_size)
yaxis[tid]=yaxis_buffer[tid]; ❶
69
70 for (uint wid = tid; wid<d_size; wid+=wgs){ ❷
71 data[wid] = data_buffer[wid]; ❷
72 } ❷
73
74 barrier(CLK_LOCAL_MEM_FENCE); ❸
75
76 double x_incr = (xaxis[50]-xaxis[0])/50.0; ❹
77 double y_incr = (yaxis[22]-yaxis[0])/22.0; ❹
78
79 int xstride = 51;
80
81 if (gid < isize) {
82 double xdata = x_array[gid]; ❺
83 double ydata = y_array[gid]; ❺
84
85 int is = (int)((xdata-xaxis[0])/x_incr); ❻
86 int js = (int)((ydata-yaxis[0])/y_incr); ❻
87 double xf = (xdata-xaxis[is])/ ❻
(xaxis[is+1]-xaxis[is]); ❻
88 double yf = (ydata-yaxis[js])/ ❻
(yaxis[js+1]-yaxis[js]); ❻
89
90 value[gid] =
xf * yf *dataval(is+1,js+1) ❼
91 + (1.0-xf)* yf *dataval(is, js+1) ❼
92 + xf *(1.0-yf)*dataval(is+1,js) ❼
93 + (1.0-xf)*(1.0-yf)*dataval(is, js); ❼
94 }
95 }table_kern.cl from PerfectHash
45 #define dataval(x,y) data[(x)+((y)*xstride)]
46
47 __kernel void interpolate_kernel(
48 const uint isize,
49 const uint xaxis_size,
50 const uint yaxis_size,
51 const uint dsize,
52 __global const double *xaxis_buffer,
53 __global const double *yaxis_buffer,
54 __global const double *data_buffer,
55 __local double *xaxis,
56 __local double *yaxis,
57 __local double *data,
58 __global const double *x_array,
59 __global const double *y_array,
60 __global double *value
61 )
62 {
63 const uint tid = get_local_id(0);
64 const uint wgs = get_local_size(0);
65 const uint gid = get_global_id(0);
66
67 if (tid < xaxis_size)
xaxis[tid]=xaxis_buffer[tid]; ❶
68 if (tid < yaxis_size)
yaxis[tid]=yaxis_buffer[tid]; ❶
69
70 for (uint wid = tid; wid<d_size; wid+=wgs){ ❷
71 data[wid] = data_buffer[wid]; ❷
72 } ❷
73
74 barrier(CLK_LOCAL_MEM_FENCE); ❸
75
76 double x_incr = (xaxis[50]-xaxis[0])/50.0; ❹
77 double y_incr = (yaxis[22]-yaxis[0])/22.0; ❹
78
79 int xstride = 51;
80
81 if (gid < isize) {
82 double xdata = x_array[gid]; ❺
83 double ydata = y_array[gid]; ❺
84
85 int is = (int)((xdata-xaxis[0])/x_incr); ❻
86 int js = (int)((ydata-yaxis[0])/y_incr); ❻
87 double xf = (xdata-xaxis[is])/ ❻
(xaxis[is+1]-xaxis[is]); ❻
88 double yf = (ydata-yaxis[js])/ ❻
(yaxis[js+1]-yaxis[js]); ❻
89
90 value[gid] =
xf * yf *dataval(is+1,js+1) ❼
91 + (1.0-xf)* yf *dataval(is, js+1) ❼
92 + xf *(1.0-yf)*dataval(is+1,js) ❼
93 + (1.0-xf)*(1.0-yf)*dataval(is, js); ❼
94 }
95 }
❸ Needs to synchronize before table queries
❹ Computes a constant increment for each axis data lookup
❻ Determines the interval for interpolation and the fraction in the interval
GPU 哈希代码的性能结果显示了此优化的影响,其加速比其他内核的单核 CPU 性能的加速更大。
The performance result for the GPU hash code shows the impact of this optimization with a larger speedup than from the single core CPU performance for the other kernels.
Sorting mesh data using a spatial perfect hash
排序操作是研究最多的算法之一,构成了许多其他操作的基础。在本节中,我们将介绍对空间数据进行排序的特殊情况。您可以使用空间排序来查找最近的相邻要素、消除重复项、简化范围查找、图形输出以及许多其他操作。
The sort operation is one of the most studied algorithms and forms the basis for many other operations. In this section, we look at the special case of sorting spatial data. You can use a spatial sort to find the nearest neighbors, eliminate duplicates, simplify range finding, graphics output, and a host of other operations.
为简单起见,我们将使用最小像元大小为 2.0 的 1D 数据。所有像元都必须是大于最小像元大小的 2 的幂。除了以下可能性的最小单元格大小外,测试用例还允许最多四个级别的粗化:2.0、4.0、8.0、16.0 和 32.0。单元格大小是随机生成的,并且单元格是随机排序的。该排序是通过快速排序来执行的,然后在 CPU 和 GPU 上使用哈希排序来执行。空间哈希排序的计算利用了有关 1D 数据的信息。我们知道 X 的最小值和最大值以及最小单元格大小。有了这些信息,我们可以计算出一个存储桶索引,该索引保证完美的哈希值
For simplicity, we’ll work with 1D data with a minimum cell size of 2.0. All cells must be a power of two larger than the minimum cell size. The test case allows up to four levels of coarsening, in addition to the minimum cell size for the following possibilities: 2.0, 4.0, 8.0, 16.0, and 32.0. Cell sizes are randomly generated, and the cells randomly ordered. The sort is performed with a quicksort and then with a hash sort on the CPU and the GPU. The calculation for the spatial hash sort exploits the information about the 1D data. We know the minimum and maximum value for X and the minimum cell size. With this information, we can calculate a bucket index that guarantees a perfect hash with
其中 b k 是放置条目的存储桶,X i 是单元格的 x 坐标,Xmin 是 X 的最小值,Δmin 是 X 的任意两个相邻值之间的最小距离。
where b k is the bucket to place the entry, X i is the x coordinate for the cell, Xmin is the minimum value of X, and Δmin is the minimum distance between any two adjacent values of X.
我们可以演示哈希排序操作(图 5.9)。值之间的最小差值为 2.0,因此存储桶大小为 2 可保证没有冲突。最小值为 0,因此可以使用 Bi = X i /Δmin = X i /2.0 来计算存储桶位置。我们可以将值或索引存储在哈希表中。例如,第一个键 8 可以存储在存储桶 4 中,也可以存储原始索引位置 0。如果该值被存储,我们使用 hash[4] 检索 8。如果索引被存储,那么我们使用 keys[hash[4]] 检索值。在这种情况下,存储索引位置的速度会稍慢一些,但更通用。它还可用于对网格中的所有数组进行重新排序。在性能研究的测试用例中,我们使用存储索引的方法。
We can demonstrate the hash sort operation (figure 5.9). The minimum difference between values is 2.0, so the bucket size of 2 guarantees that there are no collisions. The minimum value is 0, so the bucket location can be calculated with Bi = X i /Δmin = X i /2.0. We could store either the value or the index in the hash table. For example, 8, the first key, could be stored in bucket 4, or the original index location of 0 could also be stored. If the value is stored, we retrieve the 8 with hash[4]. If the index is stored, then we retrieve the value with keys[hash[4]]. Storing the index location is a little slower in this case, but it is more general. It can also be used to reorder all the arrays in a mesh. In the test case for the performance study, we use the method of storing the index.
图 5.9 使用空间完美哈希进行排序。此方法使用存储桶将值存储在哈希中,但它也可以将值的索引位置存储在原始数组中。请注意,存储桶大小为 2,范围为 0 到 24,由哈希表左侧的小数字表示。
Figure 5.9 Sorting using a spatial perfect hash. This method stores the value in the hash with a bucket, but it could also store the index location of the value in the original array. Note that the bucket size of 2 with a range of 0 to 24 is indicated by the small numbers on the left of the hash table.
空间哈希排序算法是 Θ(N ),而快速排序是 Θ(N log N )。但是空间哈希排序更专门针对手头的问题,并且可能会暂时占用更多内存。剩下的问题是这个算法写起来有多难,它的表现如何?以下清单显示了空间哈希实现的写入阶段的代码。
The spatial hash sort algorithm is Θ(N ), while the quicksort is Θ(N log N ). But the spatial hash sort is more specialized to the problem at hand and can temporarily take more memory. The remaining questions are how difficult is this algorithm to write and how does it perform? The following listing shows the code for the write phase of the spatial hash implementation.
Listing 5.11 The spatial hash sort on the CPU
sort.c from PerfectHash
283 uint hash_size =
(uint)((max_val - min_val)/min_diff); ❶
284 hash = (int*)malloc(hash_size*sizeof(int)); ❶
285 memset(hash, -1, hash_size*sizeof(int)); ❷
286
287 for(uint i = 0; i < length; i++) {
288 hash[(int)((arr[i]-min_val)/min_diff)] = i; ❸
289 }
290
291 int count=0; ❹
292 for(uint i = 0; i < hash_size; i++) { ❹
293 if(hash[i] >= 0) { ❹
294 sorted[count] = arr[hash[i]]; ❹
295 count++; ❹
296 } ❹
297 } ❹
298
299 free(hash);sort.c from PerfectHash
283 uint hash_size =
(uint)((max_val - min_val)/min_diff); ❶
284 hash = (int*)malloc(hash_size*sizeof(int)); ❶
285 memset(hash, -1, hash_size*sizeof(int)); ❷
286
287 for(uint i = 0; i < length; i++) {
288 hash[(int)((arr[i]-min_val)/min_diff)] = i; ❸
289 }
290
291 int count=0; ❹
292 for(uint i = 0; i < hash_size; i++) { ❹
293 if(hash[i] >= 0) { ❹
294 sorted[count] = arr[hash[i]]; ❹
295 count++; ❹
296 } ❹
297 } ❹
298
299 free(hash);
❶ Creates a hash table with buckets of size min_diff
❷ Sets all the elements of hash array to -1
❸ 根据 arr 值的去向将当前数组元素的索引放入 hash 中
❸ Places the index of current array element into hash according to where the arr value goes
❹ Sweeps through hash and puts set values in a sorted array
请注意,清单中的代码只有十几行。将此代码与快速排序代码进行比较,后者的长度是其五倍,并且要复杂得多。
Note that the code in the listing is barely more than a dozen lines. Compare this to a quicksort code that is five times as long and far more complicated.
图 5.10 显示了 CPU 和 GPU 的单个内核上的空间排序性能。正如我们将看到的,CPU 和 GPU 上的并行实现需要付出一些努力才能获得良好的性能。算法的读取阶段需要一个实现良好的前缀 sum,以便可以并行检索排序值。前缀 sum 是许多算法的重要模式;我们将在 5.6 节中进一步讨论它。
Figure 5.10 shows the performance of the spatial sort on both a single core of the CPU and the GPU. As we shall see, the parallel implementation on the CPU and GPU takes some effort for good performance. The read phase of the algorithm needs a well implemented prefix sum so that the retrieval of the sorted values can be done in parallel. The prefix sum is an important pattern for many algorithms; we’ll discuss it further in section 5.6.
此示例的 GPU 实现使用实现良好的前缀 sum,因此空间哈希排序的性能非常出色。在数组大小为 200 万的早期测试中,这种 GPU 排序比最快的一般 GPU 排序快 3 倍,串行 CPU 版本比标准快速排序快 4 倍。在当前的 CPU 架构和 1600 万的更大数组大小下,我们的空间哈希排序速度提高了近 6 倍(图 5.10)。值得注意的是,我们在两三个月内编写的排序比当前在 CPU 和 GPU 上最快的参考排序要快得多,特别是因为参考排序是数十年研究和许多研究人员努力的结果!
The GPU implementation for this example uses a well implemented prefix sum, and the performance of the spatial hash sort is excellent as a result. In earlier tests with an array size of two million, this GPU sort was 3x faster than the fastest general GPU sort, and the serial CPU version was 4x faster than the standard quicksort. With current CPU architectures and a larger array size of 16 million, our spatial hash sort is shown to be nearly 6x faster (figure 5.10). It is remarkable that our sort written in two or three months is much faster than the current fastest reference sorts on the CPU and GPU, especially since the reference sorts are the results of decades of research and the effort of many researchers!
图 5.10 我们的空间哈希排序显示了 CPU 的单个内核上的加速,以及 GPU 上的进一步并行加速。我们的排序比当前最快的排序快 6 倍。
Figure 5.10 Our spatial hash sort shows a speedup on a single core of the CPU and a further parallel speedup on the GPU. Our sort is 6x faster than the current fastest sort.
我们还没有完成对哈希方法的探索。完美哈希部分的算法可以大大改进。在上一节中,我们探讨了如何使用紧凑哈希进行邻居查找和重新映射操作。关键的观察结果是,我们不需要写入每个空间哈希 bin,并且我们可以通过处理冲突来改进算法。因此,这允许压缩空间哈希并使用更少的内存。这为我们提供了更多具有不同内存要求和运行时间的算法选择选择。
We are not done yet with exploring hashing methods. The algorithms in the perfect hashing section can be greatly improved. In the previous section, we explored using compact hashes for the neighbor finding and the remap operations. The key observations are that we don’t need to write to every spatial hash bin, and we can improve the algorithms by handling collisions. This thereby allows the spatial hashes to be compressed and use less memory. This gives us more options on algorithm choice with different memory requirements and run times.
Neighbor finding with write optimizations and compact hashing
之前用于查找邻居的简单完美哈希算法对于 AMR 网格中少量的网格细化级别表现良好。但是,当有六个或更多级别的细化时,粗 cell 会写入 64 个哈希桶,而精细 cell 只需要写入 1 个,从而导致负载不平衡和并行实现的线程发散问题。
The previous simple perfect hash algorithm for finding neighbors performs well for small numbers of mesh refinement levels in an AMR mesh. But when there are six or more levels of refinement, a coarse cell writes to 64 hash buckets, and a fine cell only has to write to one, leading to a load imbalance and a problem with thread divergence for parallel implementations.
线程发散是指每个线程的工作量发生变化,并且线程最终会等待最慢的线程。我们可以通过图 5.11 所示的优化进一步改进完美的哈希算法。第一个优化是意识到 neighbor 查询仅对 cell 的外部哈希桶进行采样,因此无需写入内部。进一步的分析表明,只会查询 cell 在 hash 中的表示的角或中点,从而进一步减少所需的写入。在图中,序列最右侧显示的示例进一步优化了每个 cell 仅一次写入,并在存在更精细、相同大小或较粗糙的相邻 cell 的条目时进行多次读取。最后一种技术需要将哈希表初始化为 sentinel 值(如 -1)以指示无条目。
Thread divergence is when the amount of work for each thread varies and the threads end up waiting for the slowest. We can improve the perfect hash algorithm further with the optimizations shown in figure 5.11. The first optimization is realizing that the neighbor queries only sample the outer hash buckets of a cell, so there is no need to write to the interior. Further analysis shows that only the corners or midpoints of the cell’s representation in the hash will be queried, reducing the needed writes even further. In the figure, the example shown to the far right of the sequence further optimizes the writes to only one per cell and does multiple reads where the entry exists for a finer, same size, or coarser neighbor cell. This last technique requires initializing the hash table to a sentinel value such as -1 to indicate no entry.
图 5.11 通过减少写入和读取次数,使用完美空间哈希优化邻域查找计算
Figure 5.11 Optimizing the neighbor-finding calculation using the perfect spatial hash by reducing the number of writes and reads
但是现在写入 hash 的数据更少了,我们有很多的空白空间,或者说稀疏性,并且可以将 hash table 压缩到低至 1.25 倍的条目数,大大降低了算法的内存需求。大小乘数的倒数称为哈希负载因子,定义为填充的哈希表条目数除以哈希表大小。对于 1.25 大小的乘数,哈希负载因子为 0.8。我们通常使用小得多的负载因子,通常约为 0.333 或大小乘数 3。这是因为在并行处理中,我们希望避免一个处理器比其他处理器慢。哈希稀疏性表示哈希中的空白空间。稀疏度表示压缩的机会。
But now that less data is written to the hash, we have a lot of empty space, or sparsity, and can compress the hash table to as low as 1.25x times the number of entries, greatly reducing the memory requirements of the algorithm. The inverse of the size multiplier is known as the hash load factor and is defined as the number of filled hash table entries divided by the hash table size. For a 1.25 size multiplier, the hash load factor is 0.8. We typically use a much smaller load factor, typically around 0.333 or a size multiplier of 3. This is because in parallel processing, we want to avoid one processor being slower than the others. Hash sparsity represents the empty space in the hash. Sparsity indicates the opportunity for compression.
图 5.12 显示了创建紧凑哈希的过程。由于压缩为紧凑哈希,两个条目尝试将其值存储在存储桶 1 中。第二个条目看到那里已经有一个值,所以它在一种称为 open addressing 的技术中寻找下一个 open slot。在 open addressing中,我们在哈希表中寻找下一个 open slot 并将值存储在该 slot 中。除了开放寻址之外,还有其他哈希方法,但这些方法通常需要能够在操作期间分配内存。在 GPU 上分配内存更加困难,因此我们坚持使用开放寻址,通过在已分配的哈希表中查找替代存储位置来解决冲突。
Figure 5.12 shows the process of creating a compact hash. Because of the compression to a compact hash, two entries try to store their value in bucket 1. The second entry sees that there is already a value there, so it looks for the next open slot in a technique called open addressing. In open addressing, we look for the next open slot in the hash table and store the value in that slot. There are other hashing methods than open addressing, but these often require the ability to allocate memory during an operation. Allocating memory is more difficult on the GPU, so we stick with open addressing where collisions are resolved by finding alternate storage locations within the already allocated hash table.
图 5.12 这个序列从左到右显示了如何将空间数据存储在完美的空间哈希中,将其压缩为更小的哈希,然后在发生冲突的地方寻找下一个可用的空槽来存储它。
Figure 5.12 This sequence from left to right shows the storing of spatial data in a perfect spatial hash, compressing it into a smaller hash and then, where there is a collision, looking for the next available empty slot to store it.
在开放寻址中,我们可以使用一些选项作为下一个开放时隙的试用。这些是
In open addressing, there are a few choices that we can use as the trial for the next open slot. These are
Linear probing—Where the next entry is just the next bucket in sequence until an open bucket is found
Quadratic probing—Where the increment is squared so that the attempted buckets are +1, +4, +9, and so forth from the original location
Double hashing—Where a second hashing function is used to jump to a deterministic, but pseudo-random distance from the first trial location
下一次试验选择更复杂的原因是为了避免哈希表的一部分中的值集群化,从而导致更长的存储和查询序列。我们使用二次探测方法,因为前几次尝试都在缓存中,这样可以获得更好的性能。找到槽后,将存储 key 和 value。读取哈希表时,将存储的 key 与读取的 key 进行比较,如果它们不同,则读取将尝试表中的下一个槽。
The reason for the more complex choices for the next trial is to avoid clustering of values in part of the hash table, leading to longer store and query sequences. We use the quadratic probing method because the first couple of tries are in the cache, which leads to better performance. Once a slot is found, both the key and the value are stored. When reading the hash table, the stored key is compared to the read key, and if they aren’t the same, then the read tries the next slot in the table.
我们可以通过计算写入和读取次数来估计这些优化的性能改进。但是我们需要调整这些写入和读取数字,以考虑缓存行的数量,而不仅仅是值的原始数量。此外,具有优化的代码具有更多条件。因此,运行时的改进是适度的,并且只有更高级别的网格细化才会更好。GPU 上的并行代码显示出更多好处,因为线程发散减少了。
We could make a performance estimate of the improvement of these optimizations by counting the number of writes and reads. But we need to adjust these write and read numbers to account for the number of cache lines and not just the raw number of values. Also, the code with the optimizations has more conditionals. Thus, the run-time improvement is modest and only better for higher levels of mesh refinement. The parallel code on the GPU shows more benefit because the thread divergence is reduced.
图 5.13 显示了稀疏因子相对适中的样本 AMR 网格的不同哈希表优化的测量性能结果。该代码可在 https://github.com/lanl/CompactHash.git 获取。图 5.13 中所示的 CPU 和 GPU 的最后一个性能数字用于紧凑哈希运行。紧凑哈希的成本因没有那么多的内存来初始化为 sentinel 值 -1 而抵消。效果是,与完美的哈希方法相比,紧凑哈希具有有竞争力的性能。在此测试用例中,哈希表中的稀疏性大于 30 倍压缩因子,因此紧凑哈希甚至可能比完美哈希方法更快。基于单元的 AMR 方法通常应至少具有 10 倍的压缩率,并且通常可以超过 100 倍。
Figure 5.13 shows the measured performance results for the different hash table optimizations for a sample AMR mesh that has a relatively modest sparsity factor of 30. The code is available at https://github.com/lanl/CompactHash.git. The last performance numbers shown in figure 5.13 for both the CPU and GPU are for compact hash runs. The cost of the compact hash is offset by not having as much memory to initialize to the sentinel value of -1. The effect is that the compact hash has a competitive performance compared to the perfect hashing methods. With more sparsity in the hash table than the 30x compression factor in this test case, the compact hash can even be faster than the perfect hash methods. Cell-based AMR methods in general should have at least a 10x compression and can often exceed 100x.
图 5.13 所示的 CPU 和 GPU 优化版本与图 5.11 所示的方法相对应。Compact 是 CPU 压缩,G Comp 是每个集中最后一个方法的 GPU 压缩。compact 方法比原始完美哈希更快,需要的内存要少得多。在更高的细化级别,减少写入次数的方法也显示出一些性能优势。
Figure 5.13 The optimized versions shown for the CPU and GPU correspond to the methods shown in figure 5.11. Compact is the CPU compact, and G Comp is the GPU compact for the last method in each set. The compact method is faster than the original perfect hash, requiring considerably less memory. At higher levels of refinement, the methods that reduce the number of writes show some performance benefit as well.
这些哈希方法已在 CLAMR 小程序中实现。代码在低级别稀疏性的完美哈希算法和哈希中有很多空白空间时的紧凑哈希之间切换。
These hashing methods have been implemented in the CLAMR mini-app. The code switches between a perfect hash algorithm for low levels of sparsity and the compact hash when there is a lot of empty space in the hash.
Face neighbor finding for unstructured meshes
到目前为止,我们还没有讨论非结构化网格的算法,因为很难保证可以很容易地为这些网格创建完美的哈希。最实用的方法需要一种方法来处理冲突,因此需要紧凑的哈希技术。让我们探讨一种使用哈希相当简单的情况。查找多边形网格的相邻面可能是一个昂贵的搜索过程。许多非结构化代码存储邻域映射,因为它非常昂贵。我们接下来展示的技术非常快,以至于可以动态计算邻域图。
So far, we haven’t discussed algorithms for unstructured meshes because it’s hard to guarantee that a perfect hash can easily be created for these. The most practical methods require a way to handle collisions and, thus, compact hashing techniques. Let’s explore one case where the use of a hash is fairly straightforward. Finding the neighbor face for a polygonal mesh can be an expensive search procedure. Many unstructured codes store the neighbor map because it is so expensive. The technique we show next is so fast that the neighbor map can be calculated on the fly.
哈希表的适当大小很难指定。最好的解决方案是根据面数或最小面长度选择合理的大小,然后在发生碰撞时处理碰撞。
The proper size of the hash table is difficult to specify. The best solution is to pick a reasonable size based on the number of faces or the minimum face length and then handle collisions if these occur.
Remaps with write optimizations and compact hashing
另一个操作 remap 更难优化和设置紧凑哈希,因为完美哈希方法会读取所有底层单元格。首先,我们必须想出一种不需要填充每个 Hash 桶的方法。
Another operation, the remap, is a little more difficult to optimize and set up for a compact hash because the perfect hash approach reads all the underlying cells. First, we have to come up with a way that doesn’t require every hash bucket to be filled.
图 5.14 空间哈希重映射的单写入多读实现。第一个查询是输入网格中相同大小的单元格的写入位置,如果未找到值,则下一个查询查找下一个较粗糙级别的单元格的写入位置。
Figure 5.14 A single write, multiple read implementation of a spatial hash remap. The first query is where a cell of the same size from the input mesh would write, and then if no value is found, the next query looks for where a cell at the next coarser level would write.
我们将每个单元格的单元格索引写入底层哈希的左下角。然后,在读取过程中,如果未找到值或输入网格中单元的级别不正确,我们将查找输入网格中的单元在下一个较粗糙的级别时会写入的位置。图 5.14 显示了这种方法,其中输出网格中的单元格 1 查询哈希位置 (0,2) 并找到 -1,因此它会查找下一个较粗糙的单元格所在的位置 (0,0) 并找到单元格索引 1。然后,将输出网格中单元 1 的密度设置为输入网格中单元 1 的密度。对于输出网格中的单元格 9,它在 (4,4) 处的哈希中查找,并找到输入单元格索引 3。然后,它在输入网格中查找单元 3 的级别,并且由于输入网格单元级别更精细,因此它还必须查询哈希位置 (6,4) 以获取单元索引 9 和位置 (4,6),这将返回单元索引 4 和位置 (6,6) 以获取单元索引 7。前两个单元格索引位于同一级别,因此无需进一步操作。单元格索引 7 处于更精细的级别,因此我们必须递归下降到该位置以查找单元格索引 8、5 和 6。清单 5.12 显示了代码。
We write the cell indices for each cell to the lower left corner of the underlying hash. Then, during the read, if a value is not found or the level of the cell in the input mesh is not correct, we look for where a cell in the input mesh would write if it were at the next coarser level. Figure 5.14 shows this approach, where cell 1 in the output mesh queries the hash location (0,2) and finds a -1, so it then looks for where the next coarser cell would be at (0,0) and finds the cell index of 1. The density of cell 1 in the output mesh is then set to the density of cell 1 in the input mesh. For cell 9 in the output mesh, it looks in the hash at (4,4) and finds an input cell index of 3. It then looks up the level of cell 3 in the input mesh, and because the input mesh cell level is finer, it must also query hash locations (6,4) to get the cell index of 9 and location (4,6), which returns cell index 4 and location (6,6) to get cell index of 7. The first two cell indices are at the same level, so these do not need to go any further. The cell index of 7 is at a finer level, so we must recursively descend into that location to find cell indices of 8, 5, and 6. Listing 5.12 shows the code.
Listing 5.12 The setup phase for the single-write spatial hash remap on the CPU
singlewrite_remap.cc and meshgen.cc from CompactHashRemap/AMR_remap 47 #define two_to_the(ishift) (1u <<(ishift) ) ❶ 48 49 typedef struct { ❷ 50 uint ncells; ❸ 51 uint ibasesize; ❹ 52 uint levmax; ❺ 53 uint *dist; ❻ 54 uint *i; 55 uint *j; 56 uint *level; 57 double *values; 58 } cell_list; 59 60 cell_list icells, ocells; ❼ 61 <... lots of code to create mesh ...> ❼ 120 size_t hash_size = icells.ibasesize*two_to_the(icells.levmax)* 121 icells.ibasesize*two_to_the(icells.levmax); 122 int *hash = (int *) malloc(hash_size * ❽ sizeof(int)); ❽ 123 uint i_max = icells.ibasesize*two_to_the(icells.levmax);
singlewrite_remap.cc and meshgen.cc from CompactHashRemap/AMR_remap 47 #define two_to_the(ishift) (1u <<(ishift) ) ❶ 48 49 typedef struct { ❷ 50 uint ncells; ❸ 51 uint ibasesize; ❹ 52 uint levmax; ❺ 53 uint *dist; ❻ 54 uint *i; 55 uint *j; 56 uint *level; 57 double *values; 58 } cell_list; 59 60 cell_list icells, ocells; ❼ 61 <... lots of code to create mesh ...> ❼ 120 size_t hash_size = icells.ibasesize*two_to_the(icells.levmax)* 121 icells.ibasesize*two_to_the(icells.levmax); 122 int *hash = (int *) malloc(hash_size * ❽ sizeof(int)); ❽ 123 uint i_max = icells.ibasesize*two_to_the(icells.levmax);
❶ Defines 2n power function as the shift operator for speed
❷ Structure to hold characteristics of a mesh
❹ Number of coarse cells across the x-dimension
❺ Number of refinement levels in addition to the base mesh
❻ Distribution of cells across levels of refinement
❼ Sets up input and output meshes
❽ Allocates the hash table for a perfect hash
在写入之前,分配一个完美的哈希表并将其初始化为 sentinel 值 -1(图 5.10)。然后,来自输入网格的单元格索引被写入哈希值(清单 5.13)。该代码位于 https://github.com/lanl/CompactHashRemap .git 文件 AMR_remap/singlewrite_remap.cc 中,以及用于使用紧凑哈希表和 OpenMP 的变体。GPU 的 OpenCL 版本为 AMR_remap/ h_remap_kern.cl。
Before the write, a perfect hash table is allocated and initialized to the sentinel value of -1 (figure 5.10). Then the cell indices from the input mesh are written to the hash (listing 5.13). The code is available at https://github.com/lanl/CompactHashRemap .git in the file AMR_remap/singlewrite_remap.cc, along with variants for using a compact hash table and OpenMP. The OpenCL version for the GPU is in AMR_remap/ h_remap_kern.cl.
Listing 5.13 The write phase for the single-write spatial hash remap on the CPU
AMR_remap/singlewrite_remap.cc from CompactHashRemap
127 for (uint i = 0; i < icells.ncells; i++) { ❶
128 uint lev_mod = ❷
two_to_the(icells.levmax - ❷
icells.level[i]); ❷
129 hash[((icells.j[i] * lev_mod) * i_max) ❸
+ (icells.i[i] * lev_mod)] = i; ❸
130 }AMR_remap/singlewrite_remap.cc from CompactHashRemap
127 for (uint i = 0; i < icells.ncells; i++) { ❶
128 uint lev_mod = ❷
two_to_the(icells.levmax - ❷
icells.level[i]); ❷
129 hash[((icells.j[i] * lev_mod) * i_max) ❸
+ (icells.i[i] * lev_mod)] = i; ❸
130 }
❶ The actual read part of the hash write is just four lines.
❷ The multiplier to convert between mesh levels
❸ Computes the index for the 1D hash table
读取阶段的代码(列表 5.14)有一个有趣的结构。第一部分基本上分为两种情况:输入网格中相同位置的单元是相同的级别或更粗糙的,或者它是一组更细的单元。在第一种情况下,我们循环显示级别,直到找到正确的级别,并将输出网格中的值设置为输入网格中的值。如果它更精细,我们递归降低级别,边走边总结值。
The code for the read phase (listing 5.14) has an interesting structure. The first part is basically split into two cases: the cell at the same location in the input mesh is the same level or coarser or it is a set of finer cells. In the first case, we loop up the levels until we find the right level and set the value in the output mesh to the value in the input mesh. If it is finer, we recurse down the levels, summing up the values as we go.
Listing 5.14 The read phase for the single-write spatial hash remap on the CPU
AMR_remap/singlewrite_remap.cc from CompactHashRemap
132 for (uint i = 0; i < ocells.ncells; i++) {
133 uint io = ocells.i[i];
134 uint jo = ocells.j[i];
135 uint lev = ocells.level[i];
136
137 uint lev_mod = two_to_the(ocells.levmax - lev);
138 uint ii = io*lev_mod;
139 uint ji = jo*lev_mod;
140
141 uint key = ji*i_max + ii;
142 int probe = hash[key];
144
145 if (lev > ocells.levmax){lev = ocells.levmax;}
146
147 while(probe < 0 && lev > 0) { ❶
148 lev--;
149 uint lev_diff = ocells.levmax - lev;
150 ii >>= lev_diff;
151 ii <<= lev_diff;
152 ji >>= lev_diff;
153 ji <<= lev_diff;
154 key = ji*i_max + ii;
155 probe = hash[key];
156 }
157 if (lev >= icells.level[probe]) {
158 ocells.values[i] = icells.values[probe]; ❷
159 } else {
160 ocells.values[i] = ❸
avg_sub_cells(icells, ji, ii, ❸
lev, hash); ❸
161 }
162 }
163 double avg_sub_cells (cell_list icells, uint ji, uint ii,
uint level, int *hash) {
164 uint key, i_max, jump;
165 double sum = 0.0;
166 i_max = icells.ibasesize*two_to_the(icells.levmax);
167 jump = two_to_the(icells.levmax - level - 1);
168
169 for (int j = 0; j < 2; j++) {
170 for (int i = 0; i < 2; i++) {
171 key = ((ji + (j*jump)) * i_max) + (ii + (i*jump));
172 int ic = hash[key];
173 if (icells.level[ic] == (level + 1)) {
174 sum += icells.values[ic]; ❹
175 } else {
176 sum += avg_sub_cells(icells, ji + (j*jump),
ii + (i*jump), level+1, hash); ❺
177 }
178 }
179 }
180
181 return sum/4.0;
182 }AMR_remap/singlewrite_remap.cc from CompactHashRemap
132 for (uint i = 0; i < ocells.ncells; i++) {
133 uint io = ocells.i[i];
134 uint jo = ocells.j[i];
135 uint lev = ocells.level[i];
136
137 uint lev_mod = two_to_the(ocells.levmax - lev);
138 uint ii = io*lev_mod;
139 uint ji = jo*lev_mod;
140
141 uint key = ji*i_max + ii;
142 int probe = hash[key];
144
145 if (lev > ocells.levmax){lev = ocells.levmax;}
146
147 while(probe < 0 && lev > 0) { ❶
148 lev--;
149 uint lev_diff = ocells.levmax - lev;
150 ii >>= lev_diff;
151 ii <<= lev_diff;
152 ji >>= lev_diff;
153 ji <<= lev_diff;
154 key = ji*i_max + ii;
155 probe = hash[key];
156 }
157 if (lev >= icells.level[probe]) {
158 ocells.values[i] = icells.values[probe]; ❷
159 } else {
160 ocells.values[i] = ❸
avg_sub_cells(icells, ji, ii, ❸
lev, hash); ❸
161 }
162 }
163 double avg_sub_cells (cell_list icells, uint ji, uint ii,
uint level, int *hash) {
164 uint key, i_max, jump;
165 double sum = 0.0;
166 i_max = icells.ibasesize*two_to_the(icells.levmax);
167 jump = two_to_the(icells.levmax - level - 1);
168
169 for (int j = 0; j < 2; j++) {
170 for (int i = 0; i < 2; i++) {
171 key = ((ji + (j*jump)) * i_max) + (ii + (i*jump));
172 int ic = hash[key];
173 if (icells.level[ic] == (level + 1)) {
174 sum += icells.values[ic]; ❹
175 } else {
176 sum += avg_sub_cells(icells, ji + (j*jump),
ii + (i*jump), level+1, hash); ❺
177 }
178 }
179 }
180
181 return sum/4.0;
182 }
❶ If a sentinel value is found, continues to coarser levels
❷ 由于这处于同一级别或较粗糙,因此在输入网格中设置找到的单元 ID 的值
❷ Because this is at the same level or coarser, sets the value of the found cell ID in the input mesh
❸ For finer cells, recursively descends and sums the contributors
❹ Accumulates to the new value
好的,这对 CPU 来说似乎很好,但它在 GPU 上如何工作呢?据推测,GPU 不支持递归。似乎没有任何简单的方法可以在没有递归的情况下编写此内容。但我们在 GPU 上对其进行了测试,发现它有效。它在我们尝试过的所有 GPU 上运行良好,用于任何实际网格中使用的有限数量的细化级别。显然,在 GPU 上工作的数量有限!然后,我们实现了这种方法的紧凑哈希版本,这些版本显示出良好的性能。
Ok, this seems fine for the CPU, but how is it going to work on the GPU? Supposedly, recursion is not supported on the GPU. There doesn’t seem to be any easy way to write this without recursion. But we tested it on the GPU and found that it works. It runs fine on all of the GPUs that we tried for the limited number of levels of refinement that will be used in any practical mesh. Evidently, a limited amount of recursion works on a GPU! We then implemented compact hash versions of this approach and these show good performance.
Hierarchical hash technique for the remap operation
使用哈希进行重新映射操作的另一种创新方法涉及一组分层哈希和“痕迹导航”技术。哨兵值的痕迹导航跟踪的好处是,我们不需要在开始时将哈希表初始化为 sentinel 值(图 5.15)。
Another innovative approach to using hashing for a remap operation involves a hierarchical set of hashes and a “breadcrumb” technique. A breadcrumb trail of sentinel values has the benefit that we do not need to initialize the hash tables to a sentinel value at the start (figure 5.15).
图 5.15 分层哈希表,每个级别都有单独的哈希值。在其中一个较精细的级别中完成写入时,将在上面的每个级别中放置一个 sentinel 值,以形成一个 “痕迹导航” 跟踪,以通知查询存在更精细级别的数据。
Figure 5.15 A hierarchical hash table with a separate hash for each level. When a write is done in one of the finer levels, a sentinel value is placed in each level above to form a “breadcrumb” trail to inform queries that there is data at finer levels.
第一步是为网格的每个级别分配一个哈希表。然后,将单元格索引写入适当的级别哈希值,并通过较粗的哈希值向上递归,留下一个 sentinel 值,以便查询知道更精细的哈希表中有值。查看输入网格中单元格 9 的图 5.15,我们看到
The first step is to allocate a hash table for each level of the mesh. Then the cell indices are written to the appropriate level hash and recurse upward through the coarser hashes, leaving a sentinel value so that queries know there are values in the finer-level hash tables. Looking at figure 5.15 for cell 9 in the input mesh, we see that
The cell index is written to the mid-level hash table, then a sentinel value is written to the hash bins in the coarser hash table.
The read operation for cell 9 first goes to the coarsest level of the hash table, where it finds a sentinel value of -1. It now knows that it must go to the finer levels.
它在中间级哈希表上找到三个单元格和另一个 sentinel 值,以告诉读取操作递归下降到最精细的级别,在那里它找到另外四个值以添加到求和中。
It finds three cells at the mid-level hash table and another sentinel value to tell the read operation to recursively descend to the finest level, where it finds four more values to add to the summation.
The other queries are all found in the coarsest hash table, and the values assigned to the output mesh.
每个哈希表可以是完美哈希或紧凑哈希。该方法具有递归结构,类似于前面的技术。它在 GPU 上运行良好。
Each of the hash tables can be either a perfect hash or a compact hash. The method has a recursive structure, similar to the previous technique. It also runs fine on GPUs.
前缀 sum 是使哈希排序在 5.5.1 节中并行工作的关键元素。前缀求和运算(也称为扫描)是大小不规则的计算中的常见运算。许多大小不规则的计算需要知道从哪里开始写入才能并行操作。一个简单的示例是每个处理器具有不同数量的粒子。为了能够写入输出数组或访问其他处理器或线程上的数据,每个处理器都需要知道局部索引与全局索引的关系。在前缀 sum 中,输出数组 y 是原始数组中它之前的所有数字的运行和:
The prefix sum was a critical element of making the hash sort work in parallel in section 5.5.1. The prefix sum operation, also known as a scan, is a common operation in computations with irregular sizes. Many computations with irregular sizes need to know where to start writing to be able to operate in parallel. A simple example is where each processor has a different number of particles. To be able to write to the output array or access data on other processors or threads, each processor needs to know the relationship of the local indices to the global indices. In the prefix sum, the output array, y, is a running sum of all of the numbers previous to it in the original array:
前缀 sum 可以是包含当前值的独占扫描,也可以是不包含当前值的独占扫描。前面的公式是针对独占扫描的。图 5.16 显示了独占扫描和非独占扫描。独占扫描是全局数组的起始索引,而非独占扫描是每个进程或线程的结束索引。
The prefix sum can either be an inclusive scan, where the current value is included, or an exclusive scan, where it isn’t included. The previous equation is for an exclusive scan. Figure 5.16 shows both an exclusive and an inclusive scan. The exclusive scan is the starting index for the global array, while the inclusive scan is the ending index for each process or thread.
图 5.16 数组 x 给出了每个单元中的粒子数。数组的独占和非独占扫描给出了全局数据集中的起始地址和结束地址。
Figure 5.16 The array x gives the number of particles in each cell. The exclusive and inclusive scan of an array gives the starting and ending address in the global data set.
The following listing shows the standard serial code for the scan operation.
Listing 5.15 The serial inclusive scan operation
1 y[0] = x[0];
2 for (int i=1; i<n; i++){
3 y[i] = y[i-1] + x[i];
4 }1 y[0] = x[0];
2 for (int i=1; i<n; i++){
3 y[i] = y[i-1] + x[i];
4 }
扫描操作完成后,每个进程都可以自由地并行执行其操作,因为该进程知道将其结果放在哪里。但是,扫描操作本身似乎本质上是串行的。每次迭代都依赖于前一次迭代。但是有一些有效的方法可以并行化它。在本节中,我们将介绍一种 step-efficient、一种 work-efficient 和 large array 算法。
Once the scan operation is complete, each process is free to perform its operation in parallel because the process knows where to put its result. The scan operation itself, though, appears to be intrinsically serial. Each iteration is dependent on the previous. But there are effective ways to parallelize it. We’ll look at a step-efficient, a work-efficient, and a large array algorithm in this section.
步骤效率算法使用的步骤数最少。但这可能不是最少的操作数,因为每个步骤可能有不同的操作数。这在前面 5.1 节定义计算复杂性时已经讨论过。
A step-efficient algorithm uses the fewest number of steps. But this might not be the fewest number of operations because a different number of operations is possible with each step. This was discussed earlier when defining computational complexity in section 5.1.
前缀和操作可以与基于树的缩减模式并行,如图 5.17 所示。每个元素都对其值求和,而不是等待前一个元素对其值求和。然后,它执行相同的操作,但值为 two elements over、four elements over,依此类推。最终结果是非独占扫描;在操作过程中,所有过程都很忙。
The prefix sum operation can be made parallel with a tree-based reduction pattern as figure 5.17 shows. Rather than waiting for the previous element to sum up its values, each element sums its value and the preceding value. Then it does the same operation, but with the value two elements over, four elements over, and so on. The end result is an inclusive scan; during the operation all of the processes have been busy.
图 5.17 步长非独占扫描使用 O(log2n) 步长来并行计算前缀和。
Figure 5.17 The step-efficient inclusive scan uses O(log2n) steps to compute a prefix sum in parallel.
现在我们有一个 parallel prefix,它仅以 log2n 步长运行,但串行算法的工作量增加了。我们能否设计出具有相同工作量的并行算法?
Now we have a parallel prefix that operates in just log2n steps, but the amount of work increases from the serial algorithm. Can we design a parallel algorithm that has the same amount of work?
高效工作的算法使用最少的操作数。这可能不是最少的步骤数,因为每个步骤可能具有不同数量的操作。选择高效工作算法还是步骤高效算法取决于可以存在的并行进程的数量。
A work-efficient algorithm uses the least number of operations. This might not be the fewest number of steps because a different number of operations is possible with each step. The choice of a work-efficient or a step-efficient algorithm is dependent on the number of parallel processes that can exist.
高效的并行扫描操作使用两次扫描数组。第一次扫描称为 upsweep,尽管它更像是 right sweep。如图 5.18 所示,从上到下,而不是传统的从下到上,以便于与步进效率算法进行比较。
The work-efficient parallel scan operation uses two sweeps through the arrays. The first sweep is called an upsweep, though it is more of a right sweep. It is shown in figure 5.18 from top to bottom, rather than the traditional bottom to top for easier comparison to the step-efficient algorithm.
图 5.18 高效扫描的 upsweep 阶段从上到下显示,其操作比 step-efficient 扫描少得多。实质上,所有其他值都保持不变。
Figure 5.18 The upsweep phase of the work-efficient scan shown from top to bottom, which has far fewer operations than the step-efficient scan. Essentially, every other value is left unmodified.
第二阶段,称为 downsweep 阶段,更像是左扫。它首先将最后一个值设置为零,然后进行另一次基于树的扫描(图 5.19)以获得最终结果。工作量显着减少,但需要更多的步骤。
The second phase, known as the downsweep phase, is more of a left sweep. It starts by setting the last value to zero and then does another tree-based sweep (figure 5.19) to get the final result. The amount of work is reduced significantly, but with the requirement of more steps.
图 5.19 高效独占扫描操作的下扫阶段的操作比高效步进扫描少得多。
Figure 5.19 The downsweep phase of the work-efficient exclusive scan operation has far fewer operations than the step-efficient scan.
当以这种方式显示时,高效的扫描具有一个有趣的模式,右扫描从一半的线程开始,然后逐渐减少,直到只有一个线程在运行。然后,它开始向左扫描,开始时有一个线程,结束所有线程都忙。其他步骤允许重复使用先前的计算,以便总操作数仅为 O(N)。
When shown this way, the work-efficient scan has an interesting pattern, with a right sweep starting with half the threads and decreasing until only one is operating. Then it begins a sweep back to the left with one thread at the start and finishing with all threads busy. The additional steps allow the earlier calculations to be reused so that the total operations are only O(N ).
这两种并行前缀和算法为我们提供了几种不同的选择,告诉我们如何将并行性纳入这个基本操作中。但两者都受限于 GPU 上工作组中可用的线程数或 CPU 上的处理器数。
These two parallel prefix sum algorithms give us a couple of different options on how to incorporate parallelism in this essential operation. But both of these are limited to the number of threads available in a workgroup on the GPU or the number of processors on a CPU.
对于较大的数组,我们还需要一个并行的算法。图 5.20 显示了这样一个算法,该算法使用 GPU 的三个内核。第一个内核从每个工作组的归约和开始,并将结果存储在一个临时数组中,该数组比原始大数组小工作组中的线程数。在 GPU 上,工作组中的线程数通常高达 1024。然后,第二个内核遍历临时数组,对每个工作组大小的块执行扫描。这会导致临时数组现在保存每个工作组的偏移量。然后调用第三个内核对原始数组的工作组大小的块执行 scan 操作,并为此级别的每个线程计算偏移量。
For larger arrays, we also need an algorithm that is parallel. Figure 5.20 shows such an algorithm using three kernels for the GPU. The first kernel starts with a reduction sum on each workgroup and stores the result in a temporary array that is smaller than the original large array by the number of threads in the workgroup. On the GPU, the number of threads in a workgroup is typically as high as 1,024. The second kernel then loops across the temporary array performing a scan on each work group-sized block. This results in the temporary array now holding the offsets for each work group. A third kernel is then invoked to perform the scan operation on work group-sized chunks of the original array, and an offset calculated for each thread at this level.
图 5.20 大型数组扫描分三个阶段进行,对于 GPU 来说,它是三个内核。第一阶段对中间数组进行归约和。第二阶段执行扫描以创建工作组的偏移量。然后,第三个阶段扫描原始数组并应用工作组偏移量以获取数组每个元素的扫描结果。
Figure 5.20 The large array scan proceeds in three stages and as three kernels for the GPU. The first stage does a reduction sum to an intermediate array. The second stage does a scan to create the offsets for the work groups. Then the third phase scans the original array and applies the work group offsets to get the scan results for each element of the array.
由于并行前缀 sum 在排序等操作中非常重要,因此它针对 GPU 架构进行了大量优化。我们在本书中不讨论这种程度的细节。相反,我们建议应用程序开发人员使用库或免费提供的实现来开展他们的工作。对于可用于 CUDA 的并行前缀扫描,您可以在 https://github.com/cudpp/cudpp 上找到 CUDA 数据并行基元库 (CUDPP) 等实现。对于 OpenCL,我们建议从其并行基元库 CLPP 实现,或者从我们的哈希排序代码中实现 https://github.com/LANL/ PerfectHash.git 的 sort_kern.cl 文件中提供。我们将在第 7 章中介绍 OpenMP 的前缀扫描版本。
Because the parallel prefix sum is so important in operations like sorts, it is heavily optimized for GPU architectures. We don’t go into that level of detail in this book. Instead, we suggest that application developers use libraries or freely available implementations for their work. For the parallel prefix scan available for CUDA, you’ll find implementations such as the CUDA Data Parallel Primitives Library (CUDPP), available at https://github.com/cudpp/cudpp. For OpenCL, we suggest either the implementation from its parallel primitives library, CLPP, or the scan implementation from our hash-sorting code available in the sort_kern.cl file at https://github.com/LANL/ PerfectHash.git. We’ll present a version of the prefix scan for OpenMP in chapter 7.
并非所有并行算法都是为了加快计算速度。全局总和就是这种情况的一个典型例子。并行计算从一开始就受到处理器之间求和的不可重复性的困扰。在本节中,我们将展示一个算法示例,该算法可以提高并行计算的可重复性,使其更接近原始串行计算的结果。
Not all parallel algorithms are about speeding up calculations. The global sum is a prime example of such a case. Parallel computing has been plagued since the earliest days with the non-reproducibility of sums across processors. In this section, we show one example of an algorithm that improves the reproducibility of a parallel calculation so that it gets nearer to the results of the original serial calculation.
改变加法的顺序会改变有限精度算术中的答案。这是有问题的,因为并行计算会更改加法的顺序。问题是由于有限精度算术不是结合的。随着问题大小变大,问题会变得更糟,因为最后一个值的添加在总和中的作用越来越小。最终,添加最后一个值可能根本不会更改总和。当添加两个几乎相同但符号不同的值时,添加有限精度值的情况甚至更糟糕。当一个值几乎相同时,将一个值与另一个值相减会导致灾难性的取消。结果只有几个有效数字,其余部分填充了噪声。
Changing the order of additions changes the answer in finite-precision arithmetic. This is problematic because a parallel calculation changes the order of the additions. The problem is due to finite-precision arithmetic not being associative. And the problem gets worse as the problem size gets larger because the addition of the last value becomes a smaller and smaller part of the overall sum. Eventually the addition of the last value might not change the sum at all. There is even a worse case for additions of finite-precision values when adding two values that are almost identical, but of different signs. This subtraction of one value from another when these are nearly the same causes a catastrophic cancellation. The result is only a few significant digits with noise filling the rest.
示例中的结果只剩下几个有效数字!打印值中的其余数字来自哪里?并行计算的问题在于,在两个处理器上,总和不是数组中值的线性加法,而是数组的一半的线性和,然后在末尾加上两个部分和。顺序的更改会导致全局总和不同。差异可能很小,但现在的问题是代码的并行化是否已经正确完成。使问题更加严重的是,所有新的并行化技术和硬件(如向量化和线程)也会导致此问题。此全局求和运算的模式称为 归约。
The result in the example has only a couple of significant digits left! And where do the rest of the digits in the printed value come from? The problem in parallel computing is that instead of the sum being a linear addition of the values in the array, on two processors the sum is a linear sum of half of the array and then the two partial sums added at the end. The change in the order causes the global sum to be different. The difference can be small, but now the question is whether the parallelization of the code has been done properly. Exacerbating the problem is that all of the new parallelization techniques and hardware, such as vectorization and threads, also cause this problem. The pattern for this global sum operation is called a reduction.
定义归约是将一个或多个维度的数组减少到至少一个小维度,并且通常减少到一个标量值。
Definition A reduction is an operation where an array of one or more dimensions is reduced to at least one dimension less and often to a scalar value.
此操作是并行计算中最常见的操作之一,通常关注性能,在本例中是正确性。这方面的一个例子是计算问题中的总质量或能量。这采用每个单元格中质量的全局数组,并生成单个标量值。
This operation is one of the most common in parallel computing and is often a concern for performance, and in this case, correctness. An example of this is calculating the total mass or energy in a problem. This takes a global array of the mass in each cell and results in a single scalar value.
与所有计算机计算一样,全局总和缩减的结果并不精确。在序列计算中,这不会造成严重问题,因为我们总是得到相同的不精确结果。同时,我们很可能会得到更准确的结果和更多的正确有效数字,但它与序列结果不同。这称为全局求和问题。每当串行版本和并行版本之间的结果略有不同时,原因就归咎于此问题。但通常,当花时间更深入地研究代码时,问题会变成一个微妙的并行编程错误,例如无法更新处理器之间的 ghost 单元。Ghost 单元是保存本地处理器所需的相邻处理器值的单元,如果它们未更新,则与串行运行相比,稍旧的值会导致小错误。
As with all computer calculations, the results of the global sum reduction are not exact. In serial calculations, this does not pose a serious problem because we always get the same inexact result. In parallel, we most likely get a more accurate result with more correct significant digits, but it is different than the serial result. This is known as the global sum issue. Anytime the results between the serial and parallel versions were slightly different, the cause was attributed to this problem. But often, when time was taken to dig more deeply into the code, the problem turned out to be a subtle parallel programming error such as failing to update the ghost cells between processors. Ghost cells are cells that hold the adjacent processor values needed by the local processor, and if they are not updated, the slightly older values cause a small error compared to the serial run.
多年来,我和其他并行程序员一样,一直认为唯一的解决方案是将数据按固定顺序排序,并在串行运算中求和。但是因为这太贵了,我们只能忍受这个问题。大约在 2010 年,包括我在内的几位并行程序员意识到我们看错了这个问题。这不仅仅是一个顺序问题,也是一个精度问题。在实数算术中,加法是结合的!因此,增加精度也是一种解决问题的方法,并且成本远低于对数据进行排序。
For years, I thought, like other parallel programmers, that the only solution was to sort the data into a fixed order and sum it up in a serial operation. But because this was too expensive, we just lived with the problem. In about 2010, several parallel programmers, including myself, realized that we were looking at the problem wrong. It is not solely an order problem, but also a precision problem. In real number arithmetic, addition is associative! So adding precision is also a way to solve the problem and at a far lower cost than sorting the data.
为了更好地了解这个问题以及如何解决它,让我们看一下可压缩流体动力学中的一个问题,称为 Leblanc 问题,也称为“来自地狱的激波管”。在 Leblanc 问题中,高压区域与低压区域通过在时间零处移除的隔膜隔开。这是一个具有挑战性的问题,因为由此产生的强烈冲击。但我们最感兴趣的特征是密度和能量变量的大动态范围。我们将使用能量变量,其值较高,为 1.0e-1,低值为 1.0e-10。动态范围是实数的工作集的范围,或者在本例中为最大值和最小值的比率。动态范围是 9 个数量级,这意味着当将具有大约 16 个有效数字的双精度浮点数的较小值与较大值相加时,实际上,结果中只有大约 7 个有效数字。
To gain a better understanding of the problem and how to solve it, let’s take a look at a problem from compressible fluid dynamics called the Leblanc problem, also known as “the shock tube from hell.” In the Leblanc problem, a high pressure region is separated from a low pressure region by a diaphragm that is removed at time zero. It is a challenging problem because of the strong shock that results. But the feature we are most interested in is the large dynamic range in both the density and energy variables. We’ll use the energy variable with a high value at 1.0e-1 and a low value of 1.0e-10. The dynamic range is the range of the working set of real numbers, or in this case, the ratio of the maximum and the minimum values. The dynamic range is nine orders of magnitude, which means that when adding the small value to the large value for double-precision, floating-point numbers with about 16 significant digits, in reality, we only have about 7 significant digits in the result.
让我们看看单个处理器上的问题大小为 134,217,728,其中一半值处于高能量状态,另一半值处于低能量状态。这两个区域在问题开始时由隔膜分隔。对于单个处理器,问题大小很大,但对于并行计算,它相对较小。如果首先对高能量值求和,则下一个添加的单个低能量值将有很少的有效数字可供贡献。颠倒和的顺序,以便首先对低能量值求和,使它们的总和中大小几乎相等的小值,当添加高能量值时,将有更多的有效数字,因此总和更准确。这为我们提供了一个可能的基于排序的解决方案。只需按从最低幅度到最高幅度的顺序对值进行排序,您将获得更准确的总和。有几种解决方案可用于解决全局和,它们比排序技术更容易处理。此处介绍的可能技术列表包括
Let’s look at a problem size of 134,217,728 on a single processor with half the values at the high energy state and the other half at the low energy state. These two regions are separated by a diaphragm at the beginning of the problem. The problem size is large for a single processor, but for a parallel computation, it is relatively small. If the high energy values are summed first, the next single low value that is added will have few significant digits to contribute. Reversing the order of the sum so that the low energy values are summed first makes the small values of near equal size in their sum, and by the time the high energy value is added, there will be more significant digits, thus a more accurate sum. This gives us a possible sorting-based solution. Just sort the values in order from the lowest magnitude to the highest and you will get a more accurate sum. There are several solutions for addressing the global sum that are much more tractable than the sorting technique. The list of possible techniques presented here includes
您可以在 https://github.com/EssentialsOfParallelComputing/Chapter5.git 章中尝试本章附带的练习中的各种方法。最初的研究着眼于并行 OpenMP 实现和截断技术,我们在这里不讨论。
You can try the various methods in the exercises that accompany the chapter at https://github.com/EssentialsOfParallelComputing/Chapter5.git. The original study looked at parallel OpenMP implementations and truncation techniques that we won’t go into here.
最简单的解决方案是在 x86 体系结构上使用 long-double 数据类型。在此体系结构中,long-double 在硬件中实现为 80 位浮点数,从而提供额外的 16 位精度。不幸的是,这不是一种可移植的技术。在某些架构和编译器上,long double 只有 64 位,而在其他架构和编译器上是 128 位并在软件中实现。某些编译器还强制在操作之间舍入,以保持与其他体系结构的一致性。请仔细检查您的编译器文档,了解在使用此技术时它如何实现 long-double。下一个清单中显示的代码只是一个常规和,其中累加器的数据类型设置为 long double。
The easiest solution is to use the long-double data type on a x86 architecture. On this architecture, a long-double is implemented as an 80-bit floating-point number in hardware giving an extra 16-bits of precision. Unfortunately, this is not a portable technique. The long double on some architectures and compilers is only 64-bits, and on others it’s 128-bits and implemented in software. Some compilers also force rounding between operations to maintain consistency with other architectures. Check your compiler documentation carefully on how it implements a long-double when using this technique. The code shown in the next listing is simply a regular sum with the data type of the accumulator set to long double.
列表 5.16 x86 架构上的 Long-double 数据类型 sum
Listing 5.16 Long-double data type sum on x86 architectures
GlobalSums/do_ldsum.c
1 double do_ldsum(double *var, long ncells)
2 {
3 long double ldsum = 0.0;
4 for (long i = 0; i < ncells; i++){
5 ldsum += (long double)var[i]; ❶
6 }
7 double dsum = ldsum; ❷
8 return(dsum); ❸
9 }GlobalSums/do_ldsum.c
1 double do_ldsum(double *var, long ncells)
2 {
3 long double ldsum = 0.0;
4 for (long i = 0; i < ncells; i++){
5 ldsum += (long double)var[i]; ❶
6 }
7 double dsum = ldsum; ❷
8 return(dsum); ❸
9 }
❶ var 是 double 数组,而 Acccumulator 是 long double。
❶ var is an array of doubles, while the accumulator is a long double.
❷ 函数的返回类型也可以是 long double ,返回 ldsum 的值。
❷ The return type of the function can also be long double and the value of ldsum returned.
在清单的第 8 行,返回一个 double,以保持与更高精度的累加器返回与数组相同的数据类型的概念一致。我们稍后会看到它如何执行,但首先让我们介绍解决全局和的其他方法。
At line 8 in the listing, a double is returned to stay consistent with the concept of a higher precision accumulator returning the same data type as the array. We see how this performs later, but first let’s cover the other methods for addressing the global sum.
成对求和是全局求和问题的一种非常简单的解决方案,尤其是在单个处理器中。代码相对简单,如下面的清单所示,但需要一个额外的数组,其大小是原始数组的一半。
The pairwise summation is a surprisingly simple solution to the global sum problem, especially within a single processor. The code is relatively straightforward as the following listing shows but requires an additional array half the size of the original.
Listing 5.17 Pairwise summation on processor
GlobalSums/do_pair_sum.c
4 double do_pair_sum(double *var, long ncells)
5 {
6 double *pwsum =
(double *)malloc(ncells/2*sizeof(double)); ❶
7
8 long nmax = ncells/2;
9 for (long i = 0; i<nmax; i++){ ❷
10 pwsum[i] = var[i*2]+var[i*2+1]; ❷
11 } ❷
12
13 for (long j = 1; j<log2(ncells); j++){ ❸
14 nmax /= 2; ❸
15 for (long i = 0; i<nmax; i++){ ❸
16 pwsum[i] = pwsum[i*2]+pwsum[i*2+1]; ❸
17 } ❸
18 } ❸
19 double dsum = pwsum[0]; ❹
20 free(pwsum); ❺
21 return(dsum);
22 }GlobalSums/do_pair_sum.c
4 double do_pair_sum(double *var, long ncells)
5 {
6 double *pwsum =
(double *)malloc(ncells/2*sizeof(double)); ❶
7
8 long nmax = ncells/2;
9 for (long i = 0; i<nmax; i++){ ❷
10 pwsum[i] = var[i*2]+var[i*2+1]; ❷
11 } ❷
12
13 for (long j = 1; j<log2(ncells); j++){ ❸
14 nmax /= 2; ❸
15 for (long i = 0; i<nmax; i++){ ❸
16 pwsum[i] = pwsum[i*2]+pwsum[i*2+1]; ❸
17 } ❸
18 } ❸
19 double dsum = pwsum[0]; ❹
20 free(pwsum); ❺
21 return(dsum);
22 }
❶ Needs temporary space to do the pairwise recursive sums
❷ Adds the initial pairwise sum into new array
❸ 递归地对剩余的 log2 步骤求和,每一步将数组大小减少 2
❸ Recursively sums the remaining log2 steps, reducing array size by two for each step
❹ Assigns the result to a scalar value for return
当跨处理器工作时,成对求和的简单性会变得更加复杂。如果算法忠于其基本结构,则递归和的每个步骤都可能需要通信。
The simplicity of the pairwise summation becomes a little more complicated when working across processors. If the algorithm remains true to its basic structure, a communication may be needed at each step of the recursive sum.
接下来是 Kahan 求和。Kahan 求和是可能的全局求和方法中最实用的方法。它使用一个额外的 double 变量来执行操作的其余部分,实际上使有效精度加倍。该技术由 William Kahan 于 1965 年开发(Kahan 后来成为早期 IEEE 浮点标准的主要贡献者之一)。当累加器是两个值中的较大者时,Kahan 求和最适合于运行求和。下面的清单显示了这种技术。
Next is the Kahan summation. The Kahan summation is the most practical method of the possible global sum methods. It uses an additional double variable to carry the remainder of the operation, in effect doubling the effective precision. The technique was developed by William Kahan in 1965 (Kahan later became one of the key contributors to the early IEEE floating-point standards). The Kahan summation is most appropriate for a running summation when the accumulator is the larger of two values. The following listing shows this technique.
GlobalSums/do_kahan_sum.c
1 double do_kahan_sum(double *var, long ncells)
2 {
3 struct esum_type{ ❶
4 double sum; ❶
5 double correction; ❶
6 }; ❶
7
8 double corrected_next_term, new_sum;
9 struct esum_type local;
10
11 local.sum = 0.0;
12 local.correction = 0.0;
13 for (long i = 0; i < ncells; i++) {
14 corrected_next_term = var[i] + local.correction;
15 new_sum = local.sum + local.correction;
16 local.correction = corrected_next_term - ❷
(new_sum - local.sum); ❷
17 local.sum = new_sum;
18 }
19
20 double dsum = local.sum + local.correction; ❸
21 return(dsum); ❸
22 }GlobalSums/do_kahan_sum.c
1 double do_kahan_sum(double *var, long ncells)
2 {
3 struct esum_type{ ❶
4 double sum; ❶
5 double correction; ❶
6 }; ❶
7
8 double corrected_next_term, new_sum;
9 struct esum_type local;
10
11 local.sum = 0.0;
12 local.correction = 0.0;
13 for (long i = 0; i < ncells; i++) {
14 corrected_next_term = var[i] + local.correction;
15 new_sum = local.sum + local.correction;
16 local.correction = corrected_next_term - ❷
(new_sum - local.sum); ❷
17 local.sum = new_sum;
18 }
19
20 double dsum = local.sum + local.correction; ❸
21 return(dsum); ❸
22 }
❶ Declares a double-double data type
❷ Computes the remainder to carry to the next iteration
❸ Returns the double-precision result
Kahan 求和大约需要 4 次浮点运算,而不是 1 次。但是数据可以保存在 registers 或 L1 缓存中,这使得操作的成本比我们最初预期的要低。向量化实现可以使运算成本与标准求和相同。这是一个我们使用处理器的 excess floating-point 功能来获得更好的答案的示例。
The Kahan summation takes about four floating-point operations instead of one. But the data can be kept in registers or the L1 cache, making the operation less expensive than we might initially expect. Vectorized implementations can make the operation cost the same as the standard summation. This is an example where we use the excess floating-point capability of the processor to get a better answer.
我们将在 6.3.4 节中查看 Kahan sum 的 vector 实现。一些新的数值方法正在尝试使用当前处理器的超额浮点功能来尝试类似的方法。这些方法将当前每次数据加载 50 flops 的机器余额视为一个机会,并实现需要更多浮点运算的高阶方法来利用未使用的浮点资源,因为它基本上是免费的。
We’ll look at a vector implementation of the Kahan sum in section 6.3.4. Some new numerical methods are attempting a similar approach, using the excess floating-point capability of current processors. These view the current machine balance of 50 flops per data load as an opportunity and implement higher-order methods that require more floating-point operations to exploit the unused floating-point resource because it is essentially free.
Knuth 求和方法处理任何一项都可以更大的加法。该技术由 Donald Knuth 于 1969 年开发。它以 7 次浮点运算的成本收集这两个术语的错误,如下面的清单所示。
The Knuth summation method handles additions where either term can be larger. The technique was developed by Donald Knuth in 1969. It collects the error for both terms at a cost of seven floating-point operations as the following listing shows.
GlobalSums/do_knuth_sum.c
1 double do_knuth_sum(double *var, long ncells)
2 {
3 struct esum_type{ ❶
4 double sum; ❶
5 double correction; ❶
6 }; ❶
7
8 double u, v, upt, up, vpp;
9 struct esum_type local;
10
11 local.sum = 0.0;
12 local.correction = 0.0;
13 for (long i = 0; i < ncells; i++) {
14 u = local.sum;
15 v = var[i] + local.correction;
16 upt = u + v;
17 up = upt - v; ❷
18 vpp = upt - up; ❷
19 local.sum = upt;
20 local.correction = (u - up) + (v - vpp); ❸
21 }
22
23 double dsum = local.sum + local.correction; ❹
24 return(dsum); ❹
25 }GlobalSums/do_knuth_sum.c
1 double do_knuth_sum(double *var, long ncells)
2 {
3 struct esum_type{ ❶
4 double sum; ❶
5 double correction; ❶
6 }; ❶
7
8 double u, v, upt, up, vpp;
9 struct esum_type local;
10
11 local.sum = 0.0;
12 local.correction = 0.0;
13 for (long i = 0; i < ncells; i++) {
14 u = local.sum;
15 v = var[i] + local.correction;
16 upt = u + v;
17 up = upt - v; ❷
18 vpp = upt - up; ❷
19 local.sum = upt;
20 local.correction = (u - up) + (v - vpp); ❸
21 }
22
23 double dsum = local.sum + local.correction; ❹
24 return(dsum); ❹
25 }
❶ Defines a double-double data type
❷ Carries the values for each term
❸ Combined into one correction
❹ Returns the double-precision result
最后一种技术,即四精度和,具有编码简单性的优势,但由于四精度类型几乎总是在软件中完成,因此成本很高。可移植性也是需要注意的,因为并非所有编译器都实现了四精度类型。下面的清单显示了此代码。
The last technique, the quad-precision sum, has the advantage of simplicity in coding, but because the quad-precision types are almost always done in software, it is expensive. Portability is also something to beware of as not all compilers have implemented the quad-precision type. The following listing presents this code.
Listing 5.20 Quad precision global sum
GlobalSums/do_qdsum.c
1 double do_qdsum(double *var, long ncells)
2 {
3 __float128 qdsum = 0.0; ❶
4 for (long i = 0; i < ncells; i++){
5 qdsum += (__float128)var[i]; ❷
6 }
7 double dsum =qdsum;
8 return(dsum);
9 }GlobalSums/do_qdsum.c
1 double do_qdsum(double *var, long ncells)
2 {
3 __float128 qdsum = 0.0; ❶
4 for (long i = 0; i < ncells; i++){
5 qdsum += (__float128)var[i]; ❷
6 }
7 double dsum =qdsum;
8 return(dsum);
9 }
❷ Casts the input value from array to quad precision
现在来评估这些不同的方法是如何工作的。因为一半的值是 1.0e-1,另一半是 1.0e-10,所以我们可以通过乘法而不是相加来得到准确的答案进行比较:
Now on to the assessment of how these different approaches work. Because half the values are 1.0e-1 and the other half are 1.0e-10, we can get an accurate answer to compare against by multiplying instead of adding:
accurate_answer = ncells/2 * 1.0e-1 + ncells/2 * 1.0e-10
accurate_answer = ncells/2 * 1.0e-1 + ncells/2 * 1.0e-10
表 5.1 显示了将实际获得的全局总和值与准确答案进行比较并测量运行时间的结果。我们基本上通过双精度的常规求和获得 9 位数的准确率。在具有 80 位浮点表示的系统上,long double 在一定程度上改善了它,但并不能完全消除错误。成对、Kahan 和 Knuth 求和都将误差减少到零,但运行时间略有增加。Kahan 和 Knuth 求和的向量化实现(如第 6.3.4 节所示)消除了运行时间的增加。即便如此,当考虑跨处理器通信和 MPI 调用的成本时,运行时间的增加是微不足道的。
Table 5.1 shows the results of comparing the global sum values actually obtained versus the accurate answer and measuring the run time. We essentially get nine digits of accuracy with a regular summation of doubles. The long double on a system with an 80-bit floating-point representation improves it somewhat, but doesn’t completely eliminate the error. The pairwise, Kahan and Knuth summations all reduce the error to zero with a modest increase in run time. A vectorized implementation of the Kahan and Knuth summation (shown in section 6.3.4) eliminates the increase in run time. Even so, when considering cross-processor communications and the cost of MPI calls, the increase in run time is insignificant.
Table 5.1 Precision and run-time results for various global sum techniques
现在,我们已经了解了全局求和技术在处理器上的行为,当数组分布在多个处理器上时,我们可以考虑这个问题。我们需要对 MPI 有一定的了解来解决这个问题,因此在学习了 MPI 的基础知识后,我们将在 8.3.3 节中展示如何做到这一点。
Now that we understand the behavior of the global sum techniques on a processor, we can consider the problem when the arrays are distributed across multiple processors. We need some understanding of MPI to tackle this problem, so we will show how to do this in section 8.3.3, after learning the basics of MPI.
我们已经看到了并行算法的一些特性,包括那些适用于极端并行架构的特性。让我们总结一下这些,以便我们可以在其他情况下查找它们:
We have seen some of the characteristics of parallel algorithms including those suitable for extremely parallel architectures. Let’s summarize these so that we can look for them in other situations:
Locality—Often-used term in describing good algorithms but without any definition. It can have multiple meanings. Here are a couple:
Locality for cache—Keeps the values that will be used together close together so that cache utilization is improved.
Locality for operations—Avoids operating on all the data when not all is needed. The spatial hash for particle interactions is a classic example that keeps an algorithm’s complexity O(N ) instead of O(N 2).
Asynchronous—Avoids coordination between threads that can cause synchronization.
Fewer conditionals—Besides the additional performance hit from conditional logic, thread divergence can be a problem on some architectures.
Reproducibility—Often a highly parallel technique violates the lack of associativity of finite-precision arithmetic. Enhanced-precision techniques can help counter this issue.
Higher arithmetic intensity—Current architectures have added floating-point capability faster than memory bandwidth. Algorithms that increase arithmetic intensity can make good use of parallelism such as the vector operations.
并行算法的发展仍然是一个年轻的研究领域,还有很多新算法有待发现。但也有许多已知的技术尚未得到广泛传播或使用。特别具有挑战性的是,这些算法通常位于计算机或计算科学的不同领域。
The development of parallel algorithms is still a young field of research, and there are many new algorithms to be discovered. But there are also many known techniques that have not been widely disseminated or used. Particularly challenging is that the algorithms are often in wildly different fields of computer or computational science.
For more on algorithms, we recommend a popular textbook:
Thomas Cormen 等人,《算法导论》,第 3 版(麻省理工学院出版社,2009 年)。
Thomas Cormen, et al., Introduction to Algorithms, 3rd ed (MIT Press, 2009).
For more information on patterns and algorithms, here are two good books for further reading:
Michael McCool、Arch D. Robison 和 James Reinders,结构化并行编程:高效计算模式(Morgan Kaufmann,2012 年)。
Michael McCool, Arch D. Robison, and James Reinders, Structured Parallel Programming: Patterns for Efficient Computation (Morgan Kaufmann, 2012).
Timothy G. Mattson、Beverly A. Sanders 和 Berna L. Massingill,并行编程模式(Addison-Wesley,2004 年)。
Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill, Patterns for Parallel Programming (Addison-Wesley, 2004).
空间哈希的概念是由我的一些学生开发的,从高中生到研究生。以下资源中关于完美哈希的部分借鉴了 Rachel Robey 和 David Nicholaeff 的工作。David 还在 CLAMR 小程序中实现了空间哈希。
The concepts of spatial hashing have been developed by some of my students ranging from high school level through graduate students. The section on perfect hashing in the following resource draws from work by Rachel Robey and David Nicholaeff. David also implemented spatial hashing in the CLAMR mini-app.
Rachel N. Robey、David Nicholaeff 和 Robert W. Robey,“离散化数据的基于哈希的算法”,SIAM 科学计算杂志 35,第 4 期(2013 年):C346-C368。
Rachel N. Robey, David Nicholaeff, and Robert W. Robey, “Hash-based algorithms for discretized data,” SIAM Journal on Scientific Computing 35, no. 4 (2013): C346-C368.
用于邻域查找的并行紧缩哈希的想法来自 Rebecka Tumblin、Peter Ahrens 和 Sara Hartse。这些是根据 David Nicholaeff 开发的减少写入和读取的方法构建的。
The ideas for parallel compact hashing for neighbor finding came from Rebecka Tumblin, Peter Ahrens, and Sara Hartse. These were built from the methods to reduce the writes and reads developed by David Nicholaeff.
Rebecka Tumblin、Peter Ahrens 等人,“用于计算网格的并行紧凑哈希算法”,SIAM 科学计算杂志 37,第 1 期(2015 年):C31-C53。
Rebecka Tumblin, Peter Ahrens, et al., “Parallel compact hash algorithms for computational meshes,” SIAM Journal on Scientific Computing 37, no. 1 (2015): C31-C53.
为 remap 操作开发优化的方法更具挑战性。Gerald Collom 和 Colin Redman 解决了这个问题,并在 GPU 和 OpenMP 中提出了一些真正创新的技术和实现。本章只涉及其中的一些。他们的论文中还有更多的想法:
Developing optimized methods for the remap operation was much more challenging. Gerald Collom and Colin Redman tackled the problem and came up with some really innovative techniques and implementations on the GPU and in OpenMP. This chapter only touches on some of these. There are far more ideas in their paper:
Gerald Collom、Colin Redman 和 Robert W. Robey,“使用哈希算法快速网格到网格重映射”,SIAM 科学计算杂志 40,第 4 期(2018 年):C450-C476。
Gerald Collom, Colin Redman, and Robert W. Robey, “Fast Mesh-to-Mesh Remaps Using Hash Algorithms,” SIAM Journal on Scientific Computing 40, no. 4 (2018): C450-C476.
我在 2010 年左右首次提出了增强精度全局和的概念。乔纳森·罗比 (Jonathan Robey) 在他的 Sapient hydrocode 中实施了这项技术,洛斯阿拉莫斯国家实验室的 Rob Aulwes 帮助开发了理论基础。以下两个参考提供了有关该方法的更多详细信息:
I first developed the concept of enhanced-precision global sums in about 2010. Jonathan Robey implemented the technique in his Sapient hydrocode and Rob Aulwes, Los Alamos National Laboratory, helped develop the theoretical foundations. The following two references give more details on the method:
Robert W. Robey、Jonathan M. Robey 和 Rob Aulwes,“在并行编程中寻找数值一致性”,并行计算 37,第 4-5 期(2011 年):217-229。
Robert W. Robey, Jonathan M. Robey, and Rob Aulwes, “In search of numerical consistency in parallel programming,” Parallel Computing 37, no. 4-5 (2011): 217-229.
Robert W. Robey,“生产物理应用中的计算再现性”,百万兆次级研讨会上的数值再现性 (NRE2015),高性能计算、网络、存储和分析国际会议,2015 年。可在 https://github.com/lanl/ExascaleDocs/blob/master/ ComputationalReproducibilityNRE2015.pdf
Robert W. Robey, “Computational Reproducibility in Production Physics Applications,” Numerical Reproducibility at Exascale Workshop (NRE2015), International Conference for High Performance Computing, Networking, Storage and Analysis, 2015. Available at https://github.com/lanl/ExascaleDocs/blob/master/ ComputationalReproducibilityNRE2015.pdf
A cloud collision model in an ash plume is invoked for particles within a 1 mm distance. Write pseudocode for a spatial hash implementation. What complexity order is this operation?
Big data uses a map-reduce algorithm for efficient processing of large data sets. How is it different than the hashing concepts presented here?
波浪模拟代码使用 AMR 网格来更好地细化海岸线。模拟要求是记录浮标和岸上设施所在指定位置的波高与时间的关系。因为细胞一直在被提炼,所以您如何实现这一点?
A wave simulation code uses an AMR mesh to better refine the shoreline. The simulation requirements are to record the wave heights versus time for specified locations where buoys and shore facilities are located. Because the cells are constantly being refined, how could you implement this?
Algorithms and patterns are one of the foundations of computational applications. Selecting algorithms that have low computational complexity and lend themselves to parallelization is important when first developing an application.
A comparison-based algorithm has a lower complexity limit of O(N log N ). Non-comparison algorithms can break this lower algorithmic limit.
Hashing is a non-comparison technique that has been used in spatial hashing to achieve Θ(N ) complexity for spatial operations.
For any spatial operation, there is a spatial hashing algorithm that scales as O(N ). In this chapter, we provide examples of techniques that can be used in many scenarios.
某些模式已被证明可以适应 GPU 的并行性和异步特性。前缀扫描和哈希技术就是这样两种模式。前缀扫描对于并行化大小不规则的数组非常重要。哈希是一种高度可扩展的非比较异步算法。
Certain patterns have been shown to be adaptable to parallelism and the asynchronous nature of GPUs. The prefix scan and hashing techniques are two such patterns. The prefix scan is important for parallelizing irregular-sized arrays. Hashing is a non-comparison, asynchronous algorithm that is highly scalable.
Reproducibility is an important attribute in developing robust production applications. This is especially important for reproducible global sums and for dealing with finite-precision arithmetic operations that are not associative.
Enhanced precision is a new technique that restores associativity, allowing reordering of operations, and thus, more parallelism.
今天,每个开发人员都应该了解现代 CPU 处理器中不断增长的并行性。释放 CPU 的未开发性能是并行和高性能计算应用程序的一项关键技能。为了展示如何利用 CPU 并行性,我们介绍了
Today, every developer should understand the growing parallelism available within modern CPU processors. Unlocking the untapped performance of CPUs is a critical skill for parallel and high performance computing applications. To show how to take advantage of CPU parallelism, we cover
Using threads for parallel work across multi-core processors
Coordinating work on multiple CPUs and multi-core processors with message passing
CPU 的并行功能需要成为并行策略的核心。因为它是中央主力,所以 CPU 控制所有内存分配、内存移动和通信。应用程序开发人员的知识和技能是充分利用 CPU 并行性的最重要因素。CPU 优化不是由某个神奇的编译器自动完成的。通常,CPU 上的许多并行资源未被应用程序利用。我们可以将可用的 CPU 并行度按工作量递增的顺序分解为三个组件。这些是
The CPU’s parallel capabilities need to be at the core of your parallel strategy. Because it’s the central workhorse, the CPU controls all the memory allocations, memory movement, and communication. The application developer’s knowledge and skill are the most important factors for fully using the CPU’s parallelism. CPU optimization is not automatically done by some magic compiler. Commonly, many of the parallel resources on the CPU go untapped by applications. We can break down the available CPU parallelism into three components in increasing order of effort. These are
Vectorization—Exploits the specialized hardware that can do more than one operation at a time
Multi-core and threading—Spreads out work across the many processing cores in today’s CPUs
Distributed memory—Harnesses multiple nodes into a single, cooperative computing application
因此,我们从向量化开始。向量化是一项高度未得到充分利用的功能,实施后会有显著的收益。尽管编译器可以进行一些向量化,但编译器做得还不够。对于复杂的代码,这些限制尤其明显。编译器还没有出现。尽管编译器正在改进,但没有足够的资金或人力来快速实现这一目标。因此,应用程序程序员必须以各种方式提供帮助。遗憾的是,关于向量化的文档很少。在第 6 章中,我们介绍了从向量化为您的应用程序获得更多收益的神秘知识。
Thus, we begin with vectorization. Vectorization is a highly underused capability with notable gains when implemented. Though compilers can do some vectorization, compilers don’t do enough. The limitations are especially noticeable for complicated code. Compilers are just not there yet. Although compilers are improving, there is not sufficient funding or manpower for this to happen quickly. Consequently, the application programmer has to help in a variety of ways. Unfortunately, there is little documentation on vectorization. In chapter 6, we present an introduction to the arcane knowledge of getting more from vectorization for your application.
随着每个 CPU 上处理内核的爆炸式增长,利用节点上并行性的需求和知识正在迅速增长。两种常见的 CPU 资源包括线程和共享内存。有数十种不同的线程系统和共享内存方法。在第 7 章中,我们介绍了如何使用 OpenMP,这是高性能计算最常用的线程包。
With the explosion in processing cores on each CPU, the need and knowledge for exploiting on-node parallelism is growing rapidly. Two common CPU resources for this include threading and shared memory. There are dozens of different threading systems and shared memory approaches. In chapter 7, we present a guide to using OpenMP, the most commonly used threading package for high performance computing.
跨节点甚至节点内并行性的主要语言是开源标准消息传递接口 (MPI)。MPI 标准源于并行编程早期许多消息传递库的整合。MPI 是一种设计精良的语言,经受住了时间和硬件架构变化的考验。它还适应了已纳入其实施中的新功能和改进。尽管如此,大多数应用程序程序员只使用该语言的最基本功能。在第 8 章中,我们介绍了 MPI 的基础知识,以及许多科学和大数据应用中可能有用的一些高级功能。
The dominant language for parallelism across nodes, and even within nodes, is the open source standard, the Message Passing Interface (MPI). The MPI standard grew out of a consolidation of many message-passing libraries from the early days of parallel programming. MPI is a well-designed language that has withstood the test of time and changes to hardware architectures. It has also adapted with new features and improvements that have been incorporated into its implementations. Still, most application programmers just use the most basic features of the language. In chapter 8, we give an introduction to the basics of MPI, as well as some advanced features that can be useful in many scientific and big data applications.
在 CPU 上获得高性能的关键是注意内存带宽,将数据提供给并行引擎。良好的并行性能始于良好的串行性能(以及对本书前五章中介绍的主题的理解)。CPU 为最广泛的应用程序提供最通用的并行性。从适度的并行性到极大规模,CPU 经常能提供好处。CPU 也是您必须开始进入并行世界之旅的地方。即使在使用加速器的解决方案中,CPU 仍然是系统中必不可少的组成部分。
The key to getting high performance on the CPU is to pay attention to memory bandwidth, supplying the data to parallel engines. Good parallel performance begins with good serial performance (and an understanding of the topics presented in the first five chapters of this book). CPUs provide the most general parallelism for the widest variety of applications. From modest parallelism through extreme scale, the CPU often delivers the goods. The CPU is also where you must begin your journey into the parallel world. Even in solutions that use accelerators, the CPU remains an essential component in the system.
到目前为止,提高性能的解决方案是通过向集群或高性能计算机物理添加更多节点来增加更多的计算能力。并行和高性能计算社区已经通过这种方法走得更远,并开始达到功耗和能耗极限。此外,节点和处理器的数量无法继续增长,而不会遇到扩展应用程序的限制。为此,我们必须寻求其他途径来提高性能。在处理节点中,有许多未得到充分利用的并行硬件功能。正如我们在 1.1 节中首次提到的,节点内的并行性将继续增长。
Up until now, the solution to increasing performance was to add more compute power in the form of physically adding more nodes to your cluster or high performance computer. The parallel and high performance computing community has gone as far as it can with that approach and is beginning to hit power and energy consumption limits. Additionally, the number of nodes and processors cannot continue to grow without running into the limitations of scaling applications. In response to this, we must turn to other avenues to improve performance. Within the processing node, there are a lot of underutilized parallel hardware capabilities. As we first mentioned in section 1.1, parallelism within the node will continue to grow.
即使计算能力和其他迫在眉睫的阈值持续受到限制,但鲜为人知的工具的关键见解和知识可以释放可观的性能。通过这本书和您的研究,我们可以帮助应对这些挑战。最后,您的技能和知识是解锁并行性能承诺的重要商品。
Even with continuing limitations of compute power and other looming thresholds, key insights and knowledge of lesser-known tools can unlock substantial performance. Through this book and your studies, we can help tackle these challenges. In the end, your skills and knowledge are important commodities for unlocking the promises of parallel performance.
本书第 2 部分中的三章附带的示例位于 https://github.com/EssentialsofParallelComputing,每章都有一个单独的存储库。每章的 Docker 容器构建应该可以在任何操作系统上安装并正常工作。本部分前两章(第 6 章和第 7 章)的容器构建使用图形界面,以允许使用性能和正确性工具。
The examples that accompany the three chapters in part 2 of this book are at https://github.com/EssentialsofParallelComputing, with a separate repository for each chapter. Docker container builds for each chapter should install and work well on any operating system. The container builds for the first two chapters in this part (chapters 6 and 7) use a graphical interface to allow the use of performance and correctness tools.
处理器具有特殊的向量单元,可以一次加载和操作多个数据元素。如果我们受到浮点运算的限制,那么绝对有必要使用向量化来达到峰值硬件能力。向量化是将操作分组在一起的过程,以便一次可以执行多个操作。但是,当应用程序受内存限制时,向硬件功能添加更多 flops 的好处有限。请注意,大多数应用程序都是内存受限的。编译器可能很强大,但正如您将看到的,向量化的实际性能提升可能并不像编译器文档所建议的那样容易。尽管如此,向量化的性能提升只需一点努力就可以实现,不应被忽视。
Processors have special vector units that can load and operate on more than one data element at a time. If we’re limited by floating-point operations, it is absolutely necessary to use vectorization to reach peak hardware capabilities. Vectorization is the process of grouping operations together so more than one can be done at a time. But, adding more flops to hardware capability when an application is memory bound has limited benefit. Take note, most applications are memory bound. Compilers can be powerful, but as you will see, real performance gain with vectorization might not be as easy as the compiler documentation suggests. Still, the performance gain from vectorization can be achieved with a little effort and should not be ignored.
在本章中,我们将展示程序员如何通过一点努力和知识通过向量化实现性能提升。其中一些技术只需要使用正确的编译器标志和编程样式,而其他技术则需要更多的工作。实际示例演示了实现向量化的各种方法。
In this chapter, we will show how programmers, with a little bit of effort and knowledge, can achieve a performance boost through vectorization. Some of these techniques simply require the use of the right compiler flags and programming styles, while others require much more work. Real-world examples demonstrate the various ways vectorization is achieved.
注意我们鼓励您按照 https://github.com/EssentialsofParallelComputing/Chapter6 中的本章示例进行操作。
Note We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/Chapter6.
我们在第 1.4 节中介绍了单指令多数据 (SIMD) 架构,作为 Flynn 分类法的一个组件。此分类法用作体系结构上指令和数据流的并行化分类。在 SIMD 案例中,顾名思义,有一条指令跨多个数据流执行。一条向量加法指令替换指令队列中的 8 条单独的标量加法指令,这减轻了指令队列和高速缓存的压力。最大的好处是,在一个向量单元中执行 8 次加法所需的功率与 1 次标量加法大致相同。图 6.1 显示了一个向量单元,该向量单元具有 512 位向量宽度,提供 8 个双精度值的向量长度。
We introduced the single instruction, multiple data (SIMD) architecture in section 1.4 as one component of Flynn’s Taxonomy. This taxonomy is used as a parallelization classification of instruction and data streams on an architecture. In the SIMD case, as the name indicates, there is a single instruction that is executed across multiple data streams. One vector add instruction replaces eight individual scalar add instructions in the instruction queue, which reduces the pressure on the instruction queue and cache. The biggest benefit is that it takes about the same power to perform eight additions in a vector unit as one scalar addition. Figure 6.1 shows a vector unit that has a 512-bit vector width, offering a vector length of eight double-precision values.
图 6.1 标量运算在一个周期内执行单个双精度加法。处理 64 字节的高速缓存行需要 8 个周期。相比之下,512 位向量单元上的向量运算可以在一个周期内处理所有八个双精度值。
Figure 6.1 A scalar operation does a single double-precision addition in one cycle. It takes eight cycles to process a 64-byte cache line. In comparison, a vector operation on a 512-bit vector unit can process all eight double-precision values in one cycle.
Let’s briefly summarize vectorization terminology:
Vector (SIMD) lane—通过对单个数据元素的向量寄存器进行向量运算的路径,与多车道高速公路上的车道非常相似。
Vector (SIMD) lane—A pathway through a vector operation on vector registers for a single data element much like a lane on a multi-lane freeway.
Vector width—The width of the vector unit, usually expressed in bits.
Vector length—The number of data elements that can be processed by the vector in one operation.
Vector (SIMD) instruction sets—The set of instructions that extend the regular scalar processor instructions to utilize the vector processor.
Vectorization is produced through both a software and a hardware component. The requirements are
Generate instructions—The vector instructions must be generated by the compiler or manually specified through intrinsics or assembler coding.
将指令与处理器的向量单元匹配 — 如果指令和硬件不匹配,则较新的硬件通常可以处理这些指令,但较旧的硬件将无法运行。(AVX 指令不在 10 年前的芯片上运行。对不起!
Match instructions to the vector unit of the processor—If there is a mismatch between the instructions and the hardware, newer hardware can usually process the instructions, but older hardware will just fail to run. (AVX instructions do not run on ten-year-old chips. Sorry!)
没有花哨的过程可以动态转换常规标量指令。如果你像许多程序员一样使用旧版本的编译器,它将无法为最新的硬件生成指令。遗憾的是,编译器编写者需要时间来包含新的硬件功能和指令集。编译器编写者也可能需要一段时间来优化这些功能。
There is no fancy process that converts regular scalar instructions on the fly. If you use an older version of your compiler, as many programmers do, it will not have the capability to generate the instructions for the latest hardware. Unfortunately, it takes time for compiler writers to include new hardware capabilities and instruction sets. It can also take a while for the compiler writers to optimize these capabilities.
Take away: When you use the latest processors, make sure to use the latest versions of the compiler.
您还应该指定要生成的适当向量指令集。默认情况下,大多数编译器采用安全路线并生成 SSE2 (Streaming SIMD Extensions) 指令,以便代码可以在任何硬件上运行。SSE2 指令一次只执行两个双精度操作,而不是可以在更新的处理器上完成的四个或八个操作。对于性能应用程序,有更好的选择:
You should also specify the appropriate vector instruction set to generate. By default, most compilers take the safe route and generate SSE2 (Streaming SIMD Extensions) instructions so that the code works on any hardware. SSE2 instructions only execute two double-precision operations at a time instead of the four or eight operations that can be done on more recent processors. For performance applications, there are better choices:
您可以针对过去 5 年或 10 年内制造的任何架构进行编译。指定 AVX (Advanced Vector Extensions) 指令将提供 256 位宽度向量,并且可以在 2011 年以来的任何硬件上运行。
You can compile for any architecture manufactured within the last 5 or 10 years. Specifying AVX (Advanced Vector Extensions) instructions would give a 256-bit width vector and would run on any hardware since 2011.
You can ask the compiler to generate more than one vector instruction set. It then falls back to the best one for the hardware being used.
Take away: Specify the most advanced vector instruction set in your compiler flags that you can reasonably use.
要实现前面讨论的选择,了解硬件和指令集发布的历史日期对于选择要使用的向量指令集会很有帮助。表 6.1 突出显示了关键版本,图 6.2 显示了向量单位大小的趋势。
To implement the choices discussed previously, it is helpful to know the historical dates of hardware and instruction set release for selecting which vector instruction set to use. Table 6.1 highlights the key releases, and figure 6.2 shows the trends in the vector unit size.
图 6.2 用于商用处理器的向量单元硬件的出现始于 1997 年左右,并在过去 20 年中缓慢增长,无论是在向量宽度(大小)还是支持的操作类型方面。
Figure 6.2 The appearance of vector unit hardware for commodity processors began around 1997 and has slowly grown over the last twenty years, both in vector width (size) and in types of operations supported.
表 6.1 过去十年的 vector 硬件版本极大地改进了 vector 功能。
Table 6.1 The vector hardware releases over the last decade have dramatically improved vector functionality.
有几种方法可以在程序中实现向量化。按照程序员工作量的升序排列,这些包括
There are several ways to achieve vectorization in your program. In ascending order of programmer effort, these include
为了尽可能减少实现向量化的工作量,程序员应该研究哪些库可用于他们的应用程序。许多低级库为寻求性能的程序员提供了高度优化的例程。一些最常用的库包括
For the least effort to achieve vectorization, programmers should research what libraries are available that they can use for their application. Many low-level libraries provide highly-optimized routines for programmers seeking performance. Some of the most commonly used libraries include
BLAS (Basic Linear Algebra System)—A base component of high-performance linear algebra software
FFT (Fast Fourier transform)—Various implementation packages available
Sparse Solvers—Various implementations of sparse solvers available
英特尔®数学核心函数库 (MKL) 实现了针对英特尔处理器的 BLAS、LAPACK、SCALAPACK、FFT、稀疏求解器和数学函数的优化版本。虽然该库作为某些 Intel 商业软件包的一部分提供,但也免费提供。许多其他库开发人员出于各种目的发布包。此外,硬件供应商根据不同的许可安排为其硬件提供优化的库。
The Intel® Math Kernel Library (MKL) implements optimized versions of the BLAS, LAPACK, SCALAPACK, FFTs, sparse solvers, and mathematical functions for Intel processors. Though available as a part of some Intel commercial packages, the library is also offered freely. Many other library developers release packages for a variety of purposes. Additionally, hardware vendors supply optimized libraries for their hardware under different licensing arrangements.
对于大多数程序员来说,自动向量化是推荐的选择,因为实现需要最少的编程工作。话虽如此,编译器并不总是能识别出可以安全地应用向量化的位置。在本节中,我们首先了解编译器可以自动向量化的代码类型。然后,我们将展示如何验证您是否获得了预期的实际向量化。您还将了解使编译器能够对代码进行向量化和执行其他优化的编程样式。这包括对 C 和 __restrict 使用 restrict 关键字,对 C++ 使用 __restrict__ 属性。
Auto-vectorization is the recommended choice for most programmers because implementation requires the least amount of programming effort. That being said, compilers cannot always recognize where vectorization can be applied safely. In this section, we first look at what kind of code a compiler might automatically vectorize. Then, we show how to verify that you get the actual vectorization you expect. You will also learn about programming styles that make it possible for the compiler to vectorize code and perform other optimizations. This includes the use of the restrict keyword for C and __restrict or __restrict__ attributes for C++.
随着架构和编译器的不断改进,自动向量化可以显著提高性能。适当的编译器标志和编程风格可以进一步改善这一点。
With ongoing improvements of architectures and compilers, auto-vectorization can provide significant performance improvement. The proper compiler flags and programming style can improve this further.
定义 自动向量化是编译器对标准 C、C++ 或 Fortran 语言的源代码进行向量化。
Definition Auto-vectorization is the vectorization of the source code by the compiler for standard C, C++, or Fortran languages.
我们将在 Section 6.4 中更详细地讨论编译器标志,并在 Section 17.2 中讨论 timer.c 和 timer.h 文件。使用 GCC 编译器版本 8 编译 stream_triad.c 文件时,将得到以下编译器反馈:
We will discuss compiler flags in more detail in section 6.4, and the timer.c and timer.h files in section 17.2. Compiling the stream_triad.c file with version 8 of the GCC compiler gives the following compiler feedback:
stream_triad.c:19:7: note: loop vectorized stream_triad.c:12:4: note: loop vectorized
stream_triad.c:19:7: note: loop vectorized stream_triad.c:12:4: note: loop vectorized
GCC 对初始化循环和流三元组循环进行向量化处理!我们可以使用
GCC vectorizes both the initialization loop and the stream triad loop! We can run the stream triad with
./stream_triad
./stream_triad
我们可以使用 likwid 工具(第 3.3.1 节)验证编译器是否使用向量指令。
We can verify that the compiler uses vector instructions with the likwid tool (section 3.3.1).
likwid-perfctr -C 0 -f -g MEM_DP ./stream_triad
likwid-perfctr -C 0 -f -g MEM_DP ./stream_triad
Look in the report output from this command for these lines:
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 0 | | FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 98 | | FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 640000000 | | FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 0 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 0 | | FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 98 | | FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 640000000 | | FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 0 |
在输出中,您可以看到大多数操作计数都位于第三行的 256B_PACKED_DOUBLE 类别中。为什么全部使用 256 位操作?某些版本的 GCC 编译器(包括本测试中使用的 8.2 版本)为 Skylake 处理器生成 256 位而不是 512 位向量指令。如果没有像 likwid 这样的工具,我们需要仔细检查向量化报告或检查生成的汇编器指令,以发现编译器没有生成正确的指令。对于 GCC 编译器,我们可以通过添加编译器标志 -mprefer-vector-width=512 来更改生成的指令,然后重试。现在,我们将获得 AVX512 指令,其中一次计算了 8 个双精度值:
In the output, you can see most of the operation counts are in the 256B_PACKED_DOUBLE category on the third line. Why all the 256 bit operations? Some versions of the GCC compiler, including the 8.2 version used in this test, generate 256 bits instead of the 512-bit vector instructions for the Skylake processor. Without a tool like likwid, we would need to carefully check the vectorization reports or inspect the generated assembler instructions to find out the compiler was not generating the proper instructions. For the GCC compiler, we can change the generated instructions by adding the compiler flag -mprefer-vector-width=512 and then try again. Now we’ll get AVX512 instructions with eight double-precision values computed at once:
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 | | FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 320000000 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 | | FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 320000000 |
让我们看看前面流三元组循环列表中的代码的 GCC 编译器输出:
Let’s look at the output from the GCC compiler for the code in the previous stream triad loop listing:
stream_triad.c:10:4: note: loop vectorized stream_triad.c:10:4: note: loop versioned for vectorization because of possible aliasing stream_triad.c:10:4: note: loop vectorized stream_triad.c:18:4: note: loop vectorized
stream_triad.c:10:4: note: loop vectorized stream_triad.c:10:4: note: loop versioned for vectorization because of possible aliasing stream_triad.c:10:4: note: loop vectorized stream_triad.c:18:4: note: loop vectorized
编译器无法判断函数的参数是指向相同数据还是重叠数据。这会导致编译器创建多个版本,并生成测试参数以确定要使用的版本的代码。我们可以通过在函数定义中添加 restrict 属性作为参数的一部分来解决这个问题。C99 标准添加了此关键字。遗憾的是,C++ 尚未标准化 restrict 关键字,但 __restrict 属性适用于 GCC、Clang 和 Visual C++。C++ 编译器中 attribute 的另一种常见形式是 __restrict__:
The compiler cannot tell if the arguments to the function point to the same or to overlapping data. This causes the compiler to create more than one version and produces code that tests the arguments to determine which to use. We can fix this by adding to the function definition the restrict attribute as part of the arguments. The C99 standard added this keyword. Unfortunately, C++ has not standardized the restrict keyword, but the __restrict attribute works for GCC, Clang, and Visual C++. Another common form of the attribute in C++ compilers is __restrict__:
9 void stream_triad(double* restrict a, double* restrict b,
double* restrict c, double scalar){9 void stream_triad(double* restrict a, double* restrict b,
double* restrict c, double scalar){
我们使用 GCC 编译代码,并添加了 restrict 关键字并得到了
We used GCC to compile the code with the restrict keyword added and got
stream_triad.c:10:4: note: loop vectorized stream_triad.c:10:4: note: loop vectorized stream_triad.c:18:4: note: loop vectorized
stream_triad.c:10:4: note: loop vectorized stream_triad.c:10:4: note: loop vectorized stream_triad.c:18:4: note: loop vectorized
现在,此编译器生成的函数版本较少。我们还需要指出,-fstrict-aliasing 标志告诉编译器在假设没有别名的情况下主动生成代码。
Now this compiler generates fewer versions of the function. We need to also point out that the -fstrict-aliasing flag tells the compiler to aggressively generate code with the assumption that there is no aliasing.
定义 别名 是指针指向内存重叠区域的位置。在这种情况下,编译器无法判断它是否是相同的内存,并且生成向量化代码或其他优化是不安全的。
Definition Aliasing is where pointers point to overlapping regions of memory. In this situation, the compiler cannot tell if it is the same memory, and it would be unsafe to generate vectorized code or other optimizations.
近年来,严格别名选项已成为 GCC 和其他编译器的默认设置(优化级别 -O2 和 -O3 设置 -fstrict-aliasing)。这破坏了许多实际存在别名变量的代码。因此,编译器降低了生成更高效代码的积极程度。所有这些都是为了告诉你,使用各种编译器甚至不同的编译器版本可能会得到不同的结果。
In recent years, the strict aliasing option has become the default with GCC and other compilers (optimization levels -O2 and -O3 set -fstrict-aliasing). This broke a lot of code where aliased variables actually existed. As a result, compilers have dialed back how aggressively they generate more efficient code. All of this is to tell you that you may get different results with various compilers and even different compiler versions.
通过使用 restrict 属性,您可以向编译器承诺没有别名。我们建议同时使用 restrict 属性和 -fstrict-aliasing 编译器标志。该属性可移植到所有体系结构和编译器中的源代码。您需要为每个编译器应用编译器标志,但这些标志会影响您的所有源代码。
By using the restrict attribute, you make a promise to the compiler that there is no aliasing. We recommend using both the restrict attribute and the -fstrict-aliasing compiler flag. The attribute is portable with the source across all architectures and compilers. You’ll need to apply the compiler flags for each compiler, but these affect all of your source.
从这些示例来看,程序员实现向量化的最佳操作过程似乎是让编译器自动向量化。虽然编译器正在改进,但对于更复杂的代码,编译器通常无法识别它们可以安全地向量化循环。因此,程序员需要帮助编译器提供提示。接下来我们将讨论这种技术。
From these examples, it would seem that the best course of action for programmers to get vectorization is to just let the compiler auto-vectorize. While compilers are improving, for more complex code compilers often fail to recognize that they can safely vectorize the loop. Thus, the programmer needs to help the compiler with hints. We discuss this technique next.
好吧,编译器不太能够弄清楚并生成向量化代码;我们能做些什么吗?在本节中,我们将介绍如何为编译器提供更精确的指导。反过来,这使您可以更好地控制代码的向量化过程。在这里,您将学习如何使用 pragma 和指令将信息传达给编译器,以实现向量化的可移植实施。
Ok, the compiler is not quite able to figure it out and generate vectorized code; is there something we can do? In this section, we will present how to give more precise directions to the compiler. In return, this gives you more control over the vectorization process of your code. Here, you will learn how to use pragmas and directives to convey information to the compiler for portable implementation of vectorization.
定义 Pragma 是 C 或 C++ 编译器的指令,可帮助其解释源代码。指令的形式是以 #pragma 开头的预处理器语句。(在 Fortran 中,它被称为指令,其形式是以 !$ 开头的注释行)。
Definition Pragma is an instruction to a C or C++ compiler to help it interpret the source code. The form of the instruction is a preprocessor statement starting with #pragma. (In Fortran, where it is called a directive, the form is a comment line starting with !$).
在此示例中,我们添加了 -fopt-info-vec-missed 编译器标志,以获取有关丢失的循环向量化的报告。编译这段代码可以得到
For this example, we added the -fopt-info-vec-missed compiler flag to get a report on the missed loop vectorizations. Compiling this code gives us
main.c:10:4: note: loop vectorized timestep.c:9:4: missed: couldn't vectorize loop timestep.c:9:4: missed: not vectorized: control flow in loop.
main.c:10:4: note: loop vectorized timestep.c:9:4: missed: couldn't vectorize loop timestep.c:9:4: missed: not vectorized: control flow in loop.
这个 vectorization report 告诉我们,由于循环中的 condition,timestep loop 没有被 vectorized。让我们看看我们是否可以通过添加 pragma 来优化循环。在 timestep.c 的 for 循环之前添加以下行(第 9 行):
This vectorization report tells us that the timestep loop was not vectorized due to the conditional in the loop. Let’s see if we can get the loop to optimize by adding a pragma. Add the following line just before the for loop in timestep.c (at line 9):
#pragma omp simd reduction(min:mymindt)
#pragma omp simd reduction(min:mymindt)
现在编译代码会显示关于 timestep 循环是否被向量化的冲突消息:
Now compiling the code shows conflicting messages about whether the timestep loop was vectorized:
main.c:10:4: note: loop vectorized timestep_opt.c:9:9: note: loop vectorized timestep_opt.c:11:7: note: not vectorized: control flow in loop.
main.c:10:4: note: loop vectorized timestep_opt.c:9:9: note: loop vectorized timestep_opt.c:11:7: note: not vectorized: control flow in loop.
我们需要使用性能工具(如 likwid)检查可执行文件,看看它是否真的向量化:
We need to check the executable with a performance tool such as likwid to see if it actually vectorizes:
likwid-perfctr -g MEM_DP -C 0 ./timestep_opt
likwid-perfctr -g MEM_DP -C 0 ./timestep_opt
likwid 工具的输出显示未执行任何 vector 指令:
The output from the likwid tool shows that no vector instructions are being executed:
| DP MFLOP/s | 451.4928 | | AVX DP MFLOP/s | 0 | | Packed MUOPS/s | 0 |
| DP MFLOP/s | 451.4928 | | AVX DP MFLOP/s | 0 | | Packed MUOPS/s | 0 |
在 GCC 9.0 版本的编译器中,我们可以通过添加 -fno-trapping-math 标志来对其进行向量化。如果条件块中存在除法,则此标志告诉编译器不必担心引发错误异常,因此它将进行向量化。如果条件块中有 sqrt,则 -fno-math-errno 标志将允许编译器向量化。为了更好的可移植性,pragma 还应告诉编译器某些变量不会在循环迭代中保留,因此,它们不是流或反流依赖项。在以下示例中,将在 列出 之后讨论这些依赖项。
With the GCC 9.0 version of the compiler, we have been able to get this to vectorize by adding the -fno-trapping-math flag. If there is a division in a conditional block, this flag tells the compiler not to worry about throwing an error exception, so it will then vectorize. If there is a sqrt in the conditional block, the -fno-math-errno flag will allow the compiler to vectorize. For better portability, the pragma should also tell the compiler that some variables are not preserved across loop iterations and, hence, are not a flow or anti-flow dependency. These dependencies will be discussed after the listing in the following example.
#pragma omp simd private(wavespeed, xspeed, yspeed, dt) reduction(min:mymindt)
#pragma omp simd private(wavespeed, xspeed, yspeed, dt) reduction(min:mymindt)
指示变量的范围仅限于循环的每次迭代的更好方法是在循环范围内声明变量:
An even better way to indicate that the scope of the variables is limited to each iteration of the loop is to declare the variables in the scope of the loop:
double wavespeed = sqrt(g*H[ic]); double xspeed = (fabs(U[ic])+wavespeed)/dx[ic]; double yspeed = (fabs(V[ic])+wavespeed)/dy[ic]; double dt=sigma/(xspeed+yspeed);
double wavespeed = sqrt(g*H[ic]); double xspeed = (fabs(U[ic])+wavespeed)/dx[ic]; double yspeed = (fabs(V[ic])+wavespeed)/dy[ic]; double dt=sigma/(xspeed+yspeed);
现在我们可以删除循环之前的 private 子句和变量声明。我们还可以将 restrict 属性添加到函数接口中,以通知编译器指针不重叠:
Now we can remove the private clause and the declaration of the variables prior to the loop. We can also add the restrict attribute to the function interface to inform the compiler the pointers do not overlap:
double timestep(int ncells, double g, double sigma, int* restrict celltype,
double* restrict H, double* restrict U, double* restrict V,
double* restrict dx, double* restrict dy);double timestep(int ncells, double g, double sigma, int* restrict celltype,
double* restrict H, double* restrict U, double* restrict V,
double* restrict dx, double* restrict dy);
即使进行了所有这些更改,我们也无法让 GCC 编译器对代码进行向量化。通过使用 GCC 编译器版本 9 的进一步调查,我们最终成功添加了 -fno-trapping-math 标志。如果条件块中存在除法,则此标志告诉编译器不必担心引发错误异常,因此它将进行向量化。如果条件块中有 sqrt,则 -fno-math-errno 标志允许编译器进行向量化。但是,英特尔编译器对所有版本进行了向量化处理。
Even with all of these changes, we were not able to get the GCC compiler to vectorize the code. With further investigation using version 9 of the GCC compiler, we finally were successful by adding the -fno-trapping-math flag. If there is a division in a conditional block, this flag tells the compiler not to worry about throwing an error exception so it will then vectorize. If there is a sqrt in the conditional block, the -fno-math-errno flag allows the compiler to vectorize. The Intel compiler, however, vectorizes all of the versions.
更常见的操作之一是数组的 sum。回到 Section 4.5 中,我们引入了这种类型的操作作为 reduce。我们将通过包含一个条件来增加操作的复杂性,该条件将总和限制为网格中的真实单元。在这里,真实单元被视为不在边界上的元素或来自其他处理器的 ghost 单元。我们将在第 8 章讨论幽灵细胞。
One of the more common operations is a sum of an array. Back in section 4.5, we introduced this type of operation as a reduction. We’ll add a little complexity to the operation by including a conditional that limits the sum to the real cells in a mesh. Here real cells are considered elements not on the boundary or ghost cells from other processors. We discuss ghost cells in chapter 8.
Listing 6.1 Mass sum calculation using a sum reduction loop
mass_sum/mass_sum.c
1 #include "mass_sum.h"
2 #define REAL_CELL 1
3
4 double mass_sum(int ncells, int* restrict celltype, double* restrict H,
5 double* restrict dx, double* restrict dy){
6 double summer = 0.0; ❶
7 #pragma omp simd reduction(+:summer) ❷
8 for (int ic=0; ic<ncells ; ic++) {
9 if (celltype[ic] == REAL_CELL) { ❸
10 summer += H[ic]*dx[ic]*dy[ic];
11 }
12 }
13 return(summer);
14 }mass_sum/mass_sum.c
1 #include "mass_sum.h"
2 #define REAL_CELL 1
3
4 double mass_sum(int ncells, int* restrict celltype, double* restrict H,
5 double* restrict dx, double* restrict dy){
6 double summer = 0.0; ❶
7 #pragma omp simd reduction(+:summer) ❷
8 for (int ic=0; ic<ncells ; ic++) {
9 if (celltype[ic] == REAL_CELL) { ❸
10 summer += H[ic]*dx[ic]*dy[ic];
11 }
12 }
13 return(summer);
14 }
❶ Sets the reduction variable to zero
❷ Thread loop 将 summer 视为一个减少变量。
❷ Thread loop treats summer as a reduction variable.
❸ The conditional can be implemented with a mask.
OpenMP SIMD 编译指示应自动将 reduction 变量设置为零,但当忽略编译指示时,需要在第 6 行进行初始化。第 7 行的 OpenMP SIMD 编译指示告诉编译器,我们在缩减和中使用 summer 变量。在循环中,第 9 行的条件可以在带有掩码的向量运算中实现。每个 vector 通道都有自己的 summer 副本,然后这些副本将在 for 循环结束时进行组合。
The OpenMP SIMD pragma should automatically set the reduction variable to zero, but when the pragma is ignored, the initialization on line 6 is necessary. The OpenMP SIMD pragma on line 7 tells the compiler that we use the summer variable in a reduction sum. In the loop, the conditional on line 9 can be implemented in the vector operations with a mask. Each vector lane has its own copy of summer and these will then be combined at the end of the for loop.
英特尔编译器成功识别了和缩减,并在没有 OpenMP SIMD 编译指示的情况下自动对循环进行向量化。GCC 还使用编译器版本 9 及更高版本进行向量化。
The Intel compiler successfully recognizes the sum reduction and automatically vectorizes the loop without the OpenMP SIMD pragma. GCC also vectorizes with versions 9 and later of the compiler.
在前面的示例中,由于 x 和 xnew 之间可能存在别名,因此出现了 flow 和 anti-flow 依赖关系。在这种情况下,编译器比它需要的要保守。仅在尝试对外部循环进行向量化时调用输出依赖项。编译器无法确定内部循环的后续迭代不会写入与前一次迭代相同的位置。在我们继续之前,让我们定义几个术语:
In the previous example, the flow and anti-flow dependencies arise due to the possibility of aliasing between x and xnew. The compiler is being more conservative in this case than it needs to be. The output dependency is only called out in the attempt to vectorize the outer loop. The compiler cannot be certain that the subsequent iterations of the inner loop won’t write to the same location as a prior iteration. Before we continue, let’s define a few terms:
Flow dependency—A variable within the loop is read after being written, known as a read-after-write (RAW).
Anti-flow dependency—A variable within the loop is written after being read, known as a write-after-read (WAR).
Output dependency—A variable is written to more than once in the loop.
For the GCC v8.2 compiler, the vectorization report is
stencil.c:57:10: note: loop vectorized
stencil.c:57:10: note: loop versioned for vectorization because of
possible aliasing
stencil.c:51:7: note: loop vectorized
stencil.c:37:7: note: loop vectorized
stencil.c:37:7: note: loop versioned for vectorization because of
possible aliasingstencil.c:57:10: note: loop vectorized
stencil.c:57:10: note: loop versioned for vectorization because of
possible aliasing
stencil.c:51:7: note: loop vectorized
stencil.c:37:7: note: loop vectorized
stencil.c:37:7: note: loop versioned for vectorization because of
possible aliasing
GCC 编译器选择创建两个版本和测试,以便在运行时使用。该报告足够好,可以让我们清楚地了解问题的原因。我们有两种方法可以解决这些问题。我们可以通过在循环前第 57 行添加一个 pragma 来帮助指导编译器,如下所示:
The GCC compiler chooses to create two versions and tests which to use at run time. The report is nice enough to give us a clear idea of the cause of the problem. There are two ways that we can fix these problems. We can help guide the compiler by adding a pragma before the loop at line 57 like this:
#pragma omp simd
for (int i = 1; i < imax-1; i++){#pragma omp simd
for (int i = 1; i < imax-1; i++){
解决此问题的另一种方法是在 x 和 xnew 的定义中添加 restrict 属性:
Another approach to solving this problem is to add a restrict attribute to the definition of x and xnew:
double** restrict x = malloc2D(jmax, imax); double** restrict xnew = malloc2D(jmax, imax);
double** restrict x = malloc2D(jmax, imax); double** restrict xnew = malloc2D(jmax, imax);
Intel 的向量化报告现在显示,内部循环使用向量化剥离循环、主向量化循环和向量化余数循环进行向量化。这需要更多的定义。
The vectorization report for Intel now shows that the inner loop is vectorized with a vectorized peel loop, main vectorized loop, and a vectorized remainder loop. This calls for a few more definitions.
Peel loop (剥离循环) - 对未对齐的数据执行的循环,以便主循环随后具有对齐的数据。通常,如果发现数据未对齐,则会在运行时有条件地执行剥离循环。
Peel loop—A loop to execute for misaligned data so that the main loop would then have aligned data. Often the peel loop is conditionally executed at run time if the data is discovered to be misaligned.
Remainder loop—A loop that executes after the main loop to handle a partial set of data that is too small for a full vector length.
添加 peel 循环是为了处理循环开始时的未对齐数据,而 remainder 循环负责处理循环结束时的任何额外数据。所有三个循环的报表看起来都相似。查看 main loop 报告,我们看到估计的加速比快了 6 倍多:
The peel loop is added to deal with the unaligned data at the start of the loop, and the remainder loop takes care of any extra data at the end of the loop. The reports for all three loops look similar. Looking at the main loop report, we see that the estimated speedup is over six times faster:
LOOP BEGIN at stencil.c(55,21) remark #15388: vec support: reference xnew[j][i] has aligned access [ stencil.c(56,13) ] remark #15389: vec support: reference x[j][i] has unaligned access [ stencil.c(56,28) ] remark #15389: vec support: reference x[j][i-1] has unaligned access [ stencil.c(56,38) ] remark #15389: vec support: reference x[j][i+1] has unaligned access [ stencil.c(56,50) ] remark #15389: vec support: reference x[j-1][i] has unaligned access [ stencil.c(56,62) ] remark #15389: vec support: reference x[j+1][i] has unaligned access [ stencil.c(56,74) ] remark #15381: vec support: unaligned access used inside loop body remark #15305: vec support: vector length 8 remark #15399: vec support: unroll factor set to 2 remark #15309: vec support: normalized vectorization overhead 0.236 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15449: unmasked aligned unit stride stores: 1 remark #15450: unmasked unaligned unit stride loads: 5 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 43 remark #15477: vector cost: 6.620 remark #15478: estimated potential speedup: 6.370 remark #15486: divides: 1 remark #15488: --- end vector cost summary --- remark #25015: Estimate of max trip count of loop=125 LOOP END
LOOP BEGIN at stencil.c(55,21) remark #15388: vec support: reference xnew[j][i] has aligned access [ stencil.c(56,13) ] remark #15389: vec support: reference x[j][i] has unaligned access [ stencil.c(56,28) ] remark #15389: vec support: reference x[j][i-1] has unaligned access [ stencil.c(56,38) ] remark #15389: vec support: reference x[j][i+1] has unaligned access [ stencil.c(56,50) ] remark #15389: vec support: reference x[j-1][i] has unaligned access [ stencil.c(56,62) ] remark #15389: vec support: reference x[j+1][i] has unaligned access [ stencil.c(56,74) ] remark #15381: vec support: unaligned access used inside loop body remark #15305: vec support: vector length 8 remark #15399: vec support: unroll factor set to 2 remark #15309: vec support: normalized vectorization overhead 0.236 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15449: unmasked aligned unit stride stores: 1 remark #15450: unmasked unaligned unit stride loads: 5 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 43 remark #15477: vector cost: 6.620 remark #15478: estimated potential speedup: 6.370 remark #15486: divides: 1 remark #15488: --- end vector cost summary --- remark #25015: Estimate of max trip count of loop=125 LOOP END
请注意,估计的加速比被仔细标记为潜在加速比。您不太可能获得完整的估计加速比,除非
Take note that the estimated speedup is carefully labeled as potential speedup. It is unlikely that you will get the full estimated speedup unless
在前面的实现中,使用英特尔编译器的 Skylake Gold 处理器上测得的加速比未向量化版本快 1.39 倍。此向量化报告用于处理器的加速,但我们仍然需要应对来自主内存的内存带宽受限的内核。
In the preceeding implementation, the actual measured speedup on a Skylake Gold processor with the Intel compiler is 1.39 times faster than the unvectorized version. This vectorization report is for the speedup of the processor, but we still have a memory bandwidth-limited kernel from main memory to contend with.
对于 GCC 编译器,SIMD 编译指示成功地消除了两个循环的版本控制。实际上,添加 restrict 子句没有效果,并且两个循环仍是版本控制的。此外,由于在所有这些情况下都有向量化版本,因此性能不会发生变化。为了了解加速比,我们可以将性能与关闭向量化的版本进行比较,发现使用 GCC 进行向量化的加速比大约快 1.22 倍。
For the GCC compiler, the SIMD pragma is successful at eliminating the versioning of the two loops. In fact, adding the restrict clause had no effect, and both loops are still versioned. Additionally, because there is a vectorized version in all these cases, performance doesn’t change. To understand speedup, we can compare the performance to a version where the vectorization is turned off to find that the speedup for vectorization with GCC is about 1.22 times faster.
对于即使有提示也不会向量化的麻烦循环,vector intrinsics 是另一种选择。在本节中,我们将了解如何使用 intrinsics 对向量化进行更多控制。向量内部函数的缺点是它们不太可移植。在这里,我们将查看一些使用向量内联函数成功向量化的示例,以展示如何使用内联函数对 5.7 节中介绍的 Kahan 和进行向量化。在那一节中,我们说过,在正常的求和运算中,Kahan 和的成本大约是 4 个浮点运算,而不是 1 个。但是,如果我们能对 Kahan sum 运算进行向量化,成本就会大大降低。
For troublesome loops that just don’t vectorize even with hints, vector intrinsics are another option. In this section, we’ll see how to use intrinsics for more control over vectorization. The downside of vector intrinsics is that these are less portable. Here, we will look at some examples that use vector intrinsics for successful vectorization to show the use of intrinsics to vectorize the Kahan sum introduced in section 5.7. In that section, we said the cost of the Kahan sum in a normal sum operation was about four floating-point operations instead of one. But if we can vectorize the Kahan sum operation, the cost becomes much less.
这些示例中的 implementations 使用 256 位 vector intrinsic 将操作速度比 serial 版本快了近四倍。我们在以下示例的清单中展示了实现 Kahan sum 内核的三种不同方法。您将在 https://github.com/lanl/GlobalSums.git 中找到完整的实现,它是从全局求和示例中提取的。它包含在本章代码的 GlobalSumsVectorized 目录中。
The implementations in these examples use a 256-bit vector intrinsic to speed up the operation by nearly a factor of four over the serial version. We show three different ways to implement a Kahan sum kernel in the listings for the following examples. You will find the full implementation at https://github.com/lanl/GlobalSums.git, which is extracted from the global sums example. It is included in the GlobalSumsVectorized directory in the code for this chapter.
然后,我们针对原始序列和原始 Kahan 和测试了使用三个向量内联函数实现的 Kahan 和。我们使用了 GCC 编译器的 8.2 版,并在 Skylake Gold 处理器上运行了测试。GCC 编译器无法对序列和原始 Kahan 和代码进行向量化。添加 OpenMP 编译指示会获取要向量化的串行和,但 Kahan 和中的循环携带依赖项会阻止编译器对代码进行向量化。
We then tested the Kahan sum implemented with the three vector intrinsics against the original serial sum and original Kahan sum. We used version 8.2 of the GCC compiler and ran the tests on a Skylake Gold processor. The GCC compiler fails to vectorize the serial sum and the original Kahan sum code. Adding an OpenMP pragma gets the serial sum to vectorize, but the loop-carried dependency in the Kahan sum prevents the compiler from vectorizing the code.
在以下性能结果中需要注意的是,具有所有三个向量内部函数(粗体)的 serial 和 Kahan 和的向量化版本具有几乎相同的运行时间。我们可以在相同的时间内执行更多的浮点运算,同时减少数值误差。这是一个很好的示例,通过一些努力,浮点运算可以免费提供。
It is important to note in the following performance results that the vectorized versions for serial and Kahan sums with all three vector intrinsics (bolded) have nearly identical run times. We can do more floating-point operations in the same time and simultaneously reduce the numerical error. This is a great example that with some effort, floating-point operations can come for free.
SETTINGS INFO -- ncells 1073741824 log 30
Initializing mesh with Leblanc problem, high values first
relative diff runtime Description
8.423e-09 1.273343 Serial sum
0 3.519778 Kahan sum with double double accumulator
4 wide vectors serial sum
-3.356e-09 0.683407 Intel vector intrinsics Serial sum
-3.356e-09 0.682952 GCC vector intrinsics Serial sum
-3.356e-09 0.682756 Fog C++ vector class Serial sum
4 wide vectors Kahan sum
0 1.030471 Intel Vector intrinsics Kahan sum
0 1.031490 GCC vector extensions Kahan sum
0 1.032354 Fog C++ vector class Kahan sum
8 wide vector serial sum
-1.986e-09 0.663277 Serial sum (OpenMP SIMD pragma)
-1.986e-09 0.664413 8 wide Intel vector intrinsic Serial sum
-1.986e-09 0.664067 8 wide GCC vector intrinsic Serial sum
-1.986e-09 0.663911 8 wide Fog C++ vector class Serial sum
8 wide vector Kahan sum
-1.388e-16 0.689495 8 wide Intel Vector intrinsics Kahan sum
-1.388e-16 0.689100 8 wide GCC vector extensions Kahan sum
-1.388e-16 0.689472 8 wide Fog C++ vector class Kahan sumSETTINGS INFO -- ncells 1073741824 log 30
Initializing mesh with Leblanc problem, high values first
relative diff runtime Description
8.423e-09 1.273343 Serial sum
0 3.519778 Kahan sum with double double accumulator
4 wide vectors serial sum
-3.356e-09 0.683407 Intel vector intrinsics Serial sum
-3.356e-09 0.682952 GCC vector intrinsics Serial sum
-3.356e-09 0.682756 Fog C++ vector class Serial sum
4 wide vectors Kahan sum
0 1.030471 Intel Vector intrinsics Kahan sum
0 1.031490 GCC vector extensions Kahan sum
0 1.032354 Fog C++ vector class Kahan sum
8 wide vector serial sum
-1.986e-09 0.663277 Serial sum (OpenMP SIMD pragma)
-1.986e-09 0.664413 8 wide Intel vector intrinsic Serial sum
-1.986e-09 0.664067 8 wide GCC vector intrinsic Serial sum
-1.986e-09 0.663911 8 wide Fog C++ vector class Serial sum
8 wide vector Kahan sum
-1.388e-16 0.689495 8 wide Intel Vector intrinsics Kahan sum
-1.388e-16 0.689100 8 wide GCC vector extensions Kahan sum
-1.388e-16 0.689472 8 wide Fog C++ vector class Kahan sum
在本节中,我们将介绍何时适合在应用程序中编写 vector assembly。我们还将讨论向量汇编器代码是什么样子的,如何反汇编编译后的代码,以及如何判断编译器生成的向量指令集。
In this section, we will cover when it is appropriate to write vector assembly in your application. We’ll also discuss what vector assembler code looks like, how to disassemble your compiled code, and how to tell which vector instruction set the compiler generated.
直接使用向量汇编指令对向量单元进行编程是实现最佳性能的最佳机会。但这需要深入了解跨许多不同处理器的大量 vector 指令的性能行为。没有这种专业知识的程序员可能会从使用上一节所示的向量内部函数中获得比直接编写向量汇编器指令更好的性能。此外,向量汇编代码的可移植性是有限的;它只能在一小部分处理器架构上工作。由于这些原因,编写 vector assembly 指令很少有意义。
Programming vector units directly with vector assembly instructions has the greatest opportunity to achieve maximum performance. But it takes a deep understanding of the performance behavior of the large number of vector instructions across many different processors. Programmers without this expertise will probably get better performance from using vector intrinsics as shown in the previous section than from directly writing vector assembler instructions. In addition, the portability of vector assembly code is limited; it will only work on a small set of processor architectures. For these reasons, it is rare that writing vector assembly instructions makes sense.
因为除了简单地查看编译器生成的汇编程序指令之外,做更多的事情很少有意义,所以我们不会介绍从头开始在汇编程序中编写例程的编程示例。
Because it seldom makes sense to do more than simply look at the assembler instructions that the compiler generates, we won’t go through a programming example of writing a routine in assembler from scratch.
我们建议您采用更符合向量化和其他形式的循环并行化需求的编程风格。根据从本章中的示例中吸取的经验教训,某些编程样式可以帮助编译器生成向量化代码。采用以下编程样式可以提高开箱即用的性能,并减少优化工作所需的工作。
We suggest that you adopt a programming style that is more compatible with the needs of vectorization and other forms of loop parallelization. Based on the lessons learned from the examples throughout the chapter, certain programming styles can help the compiler generate vectorized code. Adopting the following programming styles leads to better performance out of the box and less work needed for optimization efforts.
Use the restrict attribute on pointers in function arguments and declarations (C and C++).
Use pragmas or directives where needed to inform the compiler.
使用 # pragma unroll 和其他技术对编译器进行优化时要小心;您可以限制编译器转换的可能选项。阿拉伯数字
Be careful with optimizing for the compiler with # pragma unroll and other techniques; you might limit the possible options for the compiler transformations.2
Put exceptions and error checks with print statements in a separate loop.
Try to use a data structure with a long length for the innermost loop.
Use contiguous memory accesses. Some newer instruction sets implement gather/scatter memory loads, but these are less efficient.
Use Structure of Arrays (SOA) rather than Array of Structures (AOS).
Make loop bounds a local variable by copying global values and then using them.
公开循环绑定大小,以便编译器知道它。如果循环只有 3 次迭代,则编译器可能会展开循环,而不是生成 four-wide 向量指令。
Expose the loop bound size so it is known to the compiler. If the loop is only three iterations long, the compiler might unroll the loop rather than generate a four-wide vector instruction.
Define local variables within a loop so that it is clear that these are not carried to subsequent iterations (C and C++).
Variables and arrays within a loop should be write-only or read-only (only on the left side of the equal sign or on the right side, except for reductions).
Don’t reuse local variables for a different purpose in the loop—create a new variable. The memory space you waste is far less important than the confusion this creates for the compiler.
Avoid function calls and inline instead (manually or with the compiler).
Limit conditionals within the loop and, where necessary, use simple forms that can be masked.
Concerning compiler settings and flags:
Use the latest version of a compiler and prefer compilers that do better vectorization.
Generate code for the most powerful vector instruction set you can get away with.
表 6.2 和 6.3 显示了建议用于各种编译器最新版本的向量化的编译器标志。用于向量化的编译器标志经常更改,因此请查看您正在使用的编译器版本的文档。
Tables 6.2 and 6.3 show the compiler flags that are recommended for vectorization for the latest version of various compilers. Compiler flags for vectorization frequently change, so check the documentation for the compiler version that you are using.
表 6.2 第 2 列中列出的严格别名标志应该有助于 C 和 C++ 的自动向量化,但要验证它不会破坏任何代码。表 6.2 中的第 3 列具有各种选项,用于指定要用于某些编译器的向量化指令集。表中显示的应该是一个很好的起点。向量化报告可以使用表 6.2 第 2 列中的编译器标志生成。大多数编译器的编译器报告仍在改进中,并且可能会发生变化。对于 GCC,建议使用 optimized 和 missed 标志。在向量化的同时获取循环优化报告可能会有所帮助,这样您就可以查看循环是否已展开或交换。如果不使用 OpenMP 的其余部分,而是使用 OpenMP SIMD 指令,则应使用表 6.3 最后一列中的标志。
The strict aliasing flag listed in column two of table 6.2 should help with auto-vectorization for C and C++, but verify that it doesn’t break any code. Column three in table 6.2 has the various options to specify which vectorization instruction set to use for some of the compilers. The ones shown in the table should be a good starting point. Vectorization reports can be generated with the compiler flags in column two of table 6.2. The compiler reports are still improving for most of the compilers and are likely to change. For GCC, the optimized and missed flags are recommended. Getting the loop optimization reports at the same time as the vectorization can be helpful so that you can see if loops have been unrolled or interchanged. If not using the rest of OpenMP, but using OpenMP SIMD directives, the flags in the last column of table 6.3 should be used.
Table 6.2 Vectorization flags for various compilers
|
-ftree-vectorize-march=原生 -mtune=原生 -ftree-vectorize-march=native -mtune=native |
|||
|
-h vector3-h preferred_vector_width=# |
表 6.3 各种编译器的 OpenMP SIMD 和向量化报告标志
Table 6.3 OpenMP SIMD and vectorization report flags for various compilers
您可以将向量指令设置为任何单个集,例如 AVX2 或多个集。我们将向您展示如何同时执行这两项操作。对于单个指令集,上表中所示的标志请求编译器对用于编译的处理器使用向量指令集(march=native、-xHost 和 -qarch=pwer9)。如果没有此标志,编译器将使用 SSE2 集。如果您有兴趣在各种处理器上运行,则可能需要指定较旧的 instruction set 或仅使用 default。旧套装的性能有一些损失。
You can set the vector instructions to any single set such as AVX2 or multiple sets. We’ll show you how to do both. For a single instruction set, the flags shown in the previous tables request that the compiler should use the vector instruction set for the processor that is used for compiling (march=native, -xHost, and -qarch=pwer9). Without this flag, the compiler uses the SSE2 set. If you are interested in running across a wide range of processors, you may want to specify an older instruction set or just use the default. There is some loss in performance from older sets.
Intel 编译器可以添加对多个向量指令集的支持。这是 Intel Knights Landing 处理器的常见做法,其中主机处理器的指令集可能不同。为此,您必须指定两个指令集:
Support for more than one vector instruction set can be added with the Intel compiler. This is common practice for the Intel Knights Landing processor, where the instruction set for the host processor might be different. For this, you must specify both instruction sets:
-axmic-avx512 -xcore-avx2
-axmic-avx512 -xcore-avx2
-ax 添加其他集。请注意,在请求两个指令集时,不能使用 host 关键字。
The -ax adds the additional set. Note that the host keyword cannot be used when requesting two instruction sets.
在讨论清单 6.1 时,我们简要提到了在和归约内核中使用浮点标志来鼓励向量化。当使用条件对循环进行向量化时,编译器会插入一个仅使用部分向量结果的掩码。但是,掩码运算可以通过除以零或取负数的平方根来生成浮点错误。GCC 和 Clang 编译器要求设置表 6.2 最后一列中显示的额外浮点标志,以使用条件语句和任何可能存在问题的浮点运算对循环进行向量化。
We briefly mentioned the use of a floating-point flag to encourage vectorization in the sum reduction kernel when discussing listing 6.1. When vectorizing loops with a conditional, the compiler inserts a mask that uses only part of the vector results. But the masked operations can generate a floating-point error by dividing by zero or taking the square root of a negative number. GCC and Clang compilers require that the extra floating-point flags shown in the last column of table 6.2 be set to vectorize loops with conditionals and any potentially problematic floating-point operations.
在某些情况下,您可能希望关闭向量化。关闭向量化后,您可以看到向量化实现的改进和加速。它允许您检查使用和不使用向量化是否得到相同的答案。有时,自动向量化会给您错误的答案,因此,您需要将其关闭。您可能还希望仅向量化计算密集型文件,而跳过其余文件。
There are some situations where you might want to turn off vectorization. Turning off vectorization allows you to see the improvement and speedup you achieved with vectorization. It allows you to check that you get the same answer with and without vectorization. Sometimes auto-vectorization gives you the wrong answer and, thus, you would want to turn it off. You may also want to only vectorize the computationally intensive files and skip the rest.
Table 6.4 Compiler flags to turn off vectorization
|
There is no compiler flag to turn off vectorization (by default is on) |
|
|
-h vector0 -hpf0 或 -hfp1(默认向量化级别为 -h vector2) -h vector0 -hpf0 or -hfp1 (default vectorization level is -h vector2) |
本章中的向量化和性能结果使用 GCC v8 和 v9 以及英特尔编译器 v19。如表 6.1 所示,从版本 8 开始,GCC 和 Intel 在版本 18 中添加了 512 位向量支持。因此,新的 512 位向量硬件的功能是最新的。
The vectorization and performance results in this chapter were with GCC v8 and v9 and with the Intel compiler v19. As noted in table 6.1, the 512-bit vector support was added to GCC starting in version 8 and in Intel in version 18. So the capability for the new 512-bit vector hardware is recent.
A cmake module for setting compiler flags
为编译器向量化设置所有标志很麻烦,而且很难正确。因此,我们创建了一个您可以使用的 CMake 模块,它类似于 FindOpenMP .cmake 和 FindMPI.cmake 模块。然后主 CMakeLists.txt 文件只需要
Setting all of the flags for compiler vectorization is messy and difficult to get right. So we have created a CMake module that you can use, which is similar to the FindOpenMP .cmake and FindMPI.cmake modules. Then the main CMakeLists.txt file just needs
find_package(Vector)
if (CMAKE_VECTOR_VERBOSE)
set(VECTOR_C_FLAGS "${VECTOR_C_FLAGS} ${VECTOR_C_VERBOSE}")
endif()
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${VECTOR_C_FLAGS}")find_package(Vector)
if (CMAKE_VECTOR_VERBOSE)
set(VECTOR_C_FLAGS "${VECTOR_C_FLAGS} ${VECTOR_C_VERBOSE}")
endif()
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${VECTOR_C_FLAGS}")
CMake 模块显示在 https://github.com/EssentialsofParallelComputing/Chapter6.git 的本章示例主目录下的 FindVector.cmake 中。另请参阅 GlobalSumsVectorized 代码示例,了解如何使用 FindVector.cmake 模块。我们将把这个模块迁移到其他例子中,以帮助清理我们的 CMakeLists.txt 文件。下面的清单是 C 编译器模块的摘录。C++ 和 Fortran 的标志也使用 FindVector.cmake 模块中的类似代码进行设置。
The CMake module is shown in FindVector.cmake in the main directory for this chapter’s examples at https://github.com/EssentialsofParallelComputing/Chapter6.git. Also see the GlobalSumsVectorized code example for using the FindVector.cmake module. We’ll migrate the module to other examples to help clean up our CMakeLists.txt file as well. The following listing is an excerpt from the module for the C compiler. The flags for C++ and Fortran are also set with similar code in the FindVector.cmake module.
清单 6.2 摘自 C 编译器的 FindVector.cmake
Listing 6.2 Excerpt from FindVector.cmake for C compiler
FindVector.cmake 8 # Main output flags 9 # VECTOR_<LANG>_FLAGS ❶ 10 # VECTOR_NOVEC_<LANG>_FLAGS ❷ 11 # VECTOR_<LANG>_VERBOSE ❸ 12 # Component flags 13 # VECTOR_ALIASING_<LANG>_FLAGS ❹ 14 # VECTOR_ARCH_<LANG>_FLAGS ❺ 15 # VECTOR_FPMODEL_<LANG>_FLAGS ❻ 16 # VECTOR_NOVEC_<LANG>_OPT ❼ 17 # VECTOR_VEC_<LANG>_OPTS ❽ ... 25 if(CMAKE_C_COMPILER_LOADED) 26 if ("${CMAKE_C_COMPILER_ID}" STREQUAL "Clang") # using Clang 27 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -fstrict-aliasing") 28 if ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 29 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native") 30 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "ppc64le") 31 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -mcpu=powerpc64le") 32 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "aarch64") 33 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native") 34 endif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 35 36 set(VECTOR_OPENMP_SIMD_C_FLAGS "${VECTOR_OPENMP_SIMD_C_FLAGS} -fopenmp-simd") 37 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -fvectorize") 38 set(VECTOR_C_FPOPTS "${VECTOR_C_FPOPTS} -fno-math-errno") 39 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -fno-vectorize") 40 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize") 41 42 elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "GNU") # using GCC 43 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -fstrict-aliasing") 44 if ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 45 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native") 46 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "ppc64le") 47 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -mcpu=powerpc64le") 48 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "aarch64") 49 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native”) 50 endif (“${CMAKE_SYSTEM_PROCESSOR}” STREQUAL “x86_64”) 51 52 set(VECTOR_OPENMP_SIMD_C_FLAGS “${VECTOR_OPENMP_SIMD_C_FLAGS} -fopenmp-simd") 53 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -ftree-vectorize") 54 set(VECTOR_C_FPOPTS "${VECTOR_C_FPOPTS} -fno-trapping-math -fno-math-errno") 55 if ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 56 if ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "7.9.0") 57 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -mprefer-vector-width=512") 58 endif ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "7.9.0") 59 endif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 60 61 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -fno-tree-vectorize") 62 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -fopt-info-vec-optimized -fopt-info-vec-missed -fopt-info-loop-optimized -fopt-info-loop-missed") 63 64 elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "Intel") # using Intel C 65 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -ansi-alias") 66 set(VECTOR_FPMODEL_C_FLAGS "${VECTOR_FPMODEL_C_FLAGS} -fp-model:precise") 67 68 set(VECTOR_OPENMP_SIMD_C_FLAGS "${VECTOR_OPENMP_SIMD_C_FLAGS} -qopenmp-simd") 69 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -xHOST") 70 if ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "17.0.4") 71 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -qopt-zmm-usage=high") 72 endif ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "17.0.4") 73 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -no-vec") 74 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -qopt-report=5 -qopt-report-phase=openmp,loop,vec") 75 76 elseif (CMAKE_C_COMPILER_ID MATCHES "PGI") 77 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -alias=ansi") 78 set(VECTOR_OPENMP_SIMD_C_FLAGS "${VECTOR_OPENMP_SIMD_C_FLAGS} -Mvect=simd") 79 80 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -Mnovect ") 81 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -Minfo=loop,inline,vect") 82 83 elseif (CMAKE_C_COMPILER_ID MATCHES "MSVC") 84 set(VECTOR_C_OPTS "${VECTOR_C_OPTS}" " ") 85 86 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT}" " ") 87 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -Qvec-report:2") 88 89 elseif (CMAKE_C_COMPILER_ID MATCHES "XL") 90 set(VECTOR_ALIASING_C_FLAGSS "${VECTOR_ALIASING_C_FLAGS} -qalias=restrict") 91 set(VECTOR_FPMODEL_C_FLAGSS "${VECTOR_FPMODEL_C_FLAGS} -qstrict") 92 set(VECTOR_ARCH_C_FLAGSS "${VECTOR_ARCH_C_FLAGS} -qhot -qarch=auto -qtune=auto") 93 94 set(CMAKE_VEC_C_FLAGS "${CMAKE_VEC_FLAGS} -qsimd=auto") 95 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -qsimd=noauto") 96 # "long vector" optimizations 97 #set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -qhot=novector") 98 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -qreport") 99 100 elseif (CMAKE_C_COMPILER_ID MATCHES "Cray") 101 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -h restrict=a") 102 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -h vector=3") 103 104 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -h vector=0") 105 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -h msgs -h negmsgs -h list=a") 106 107 endif() 108 109 set(VECTOR_BASE_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} ${VECTOR_ARCH_C_FLAGS} ${VECTOR_FPMODEL_C_FLAGS}") 110 set(VECTOR_NOVEC_C_FLAGS "${VECTOR_BASE_C_FLAGS} ${VECTOR_NOVEC_C_OPT}") 111 set(VECTOR_C_FLAGS "${VECTOR_BASE_C_FLAGS} ${VECTOR_C_OPTS} ${VECTOR_C_FPOPTS} ${VECTOR_OPENMP_SIMD_C_FLAGS}") 112 endif()
FindVector.cmake 8 # Main output flags 9 # VECTOR_<LANG>_FLAGS ❶ 10 # VECTOR_NOVEC_<LANG>_FLAGS ❷ 11 # VECTOR_<LANG>_VERBOSE ❸ 12 # Component flags 13 # VECTOR_ALIASING_<LANG>_FLAGS ❹ 14 # VECTOR_ARCH_<LANG>_FLAGS ❺ 15 # VECTOR_FPMODEL_<LANG>_FLAGS ❻ 16 # VECTOR_NOVEC_<LANG>_OPT ❼ 17 # VECTOR_VEC_<LANG>_OPTS ❽ ... 25 if(CMAKE_C_COMPILER_LOADED) 26 if ("${CMAKE_C_COMPILER_ID}" STREQUAL "Clang") # using Clang 27 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -fstrict-aliasing") 28 if ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 29 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native") 30 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "ppc64le") 31 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -mcpu=powerpc64le") 32 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "aarch64") 33 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native") 34 endif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 35 36 set(VECTOR_OPENMP_SIMD_C_FLAGS "${VECTOR_OPENMP_SIMD_C_FLAGS} -fopenmp-simd") 37 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -fvectorize") 38 set(VECTOR_C_FPOPTS "${VECTOR_C_FPOPTS} -fno-math-errno") 39 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -fno-vectorize") 40 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize") 41 42 elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "GNU") # using GCC 43 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -fstrict-aliasing") 44 if ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 45 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native") 46 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "ppc64le") 47 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -mcpu=powerpc64le") 48 elseif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "aarch64") 49 set(VECTOR_ARCH_C_FLAGS "${VECTOR_ARCH_C_FLAGS} -march=native -mtune=native”) 50 endif (“${CMAKE_SYSTEM_PROCESSOR}” STREQUAL “x86_64”) 51 52 set(VECTOR_OPENMP_SIMD_C_FLAGS “${VECTOR_OPENMP_SIMD_C_FLAGS} -fopenmp-simd") 53 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -ftree-vectorize") 54 set(VECTOR_C_FPOPTS "${VECTOR_C_FPOPTS} -fno-trapping-math -fno-math-errno") 55 if ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 56 if ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "7.9.0") 57 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -mprefer-vector-width=512") 58 endif ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "7.9.0") 59 endif ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64") 60 61 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -fno-tree-vectorize") 62 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -fopt-info-vec-optimized -fopt-info-vec-missed -fopt-info-loop-optimized -fopt-info-loop-missed") 63 64 elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "Intel") # using Intel C 65 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -ansi-alias") 66 set(VECTOR_FPMODEL_C_FLAGS "${VECTOR_FPMODEL_C_FLAGS} -fp-model:precise") 67 68 set(VECTOR_OPENMP_SIMD_C_FLAGS "${VECTOR_OPENMP_SIMD_C_FLAGS} -qopenmp-simd") 69 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -xHOST") 70 if ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "17.0.4") 71 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -qopt-zmm-usage=high") 72 endif ("${CMAKE_C_COMPILER_VERSION}" VERSION_GREATER "17.0.4") 73 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -no-vec") 74 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -qopt-report=5 -qopt-report-phase=openmp,loop,vec") 75 76 elseif (CMAKE_C_COMPILER_ID MATCHES "PGI") 77 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -alias=ansi") 78 set(VECTOR_OPENMP_SIMD_C_FLAGS "${VECTOR_OPENMP_SIMD_C_FLAGS} -Mvect=simd") 79 80 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -Mnovect ") 81 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -Minfo=loop,inline,vect") 82 83 elseif (CMAKE_C_COMPILER_ID MATCHES "MSVC") 84 set(VECTOR_C_OPTS "${VECTOR_C_OPTS}" " ") 85 86 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT}" " ") 87 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -Qvec-report:2") 88 89 elseif (CMAKE_C_COMPILER_ID MATCHES "XL") 90 set(VECTOR_ALIASING_C_FLAGSS "${VECTOR_ALIASING_C_FLAGS} -qalias=restrict") 91 set(VECTOR_FPMODEL_C_FLAGSS "${VECTOR_FPMODEL_C_FLAGS} -qstrict") 92 set(VECTOR_ARCH_C_FLAGSS "${VECTOR_ARCH_C_FLAGS} -qhot -qarch=auto -qtune=auto") 93 94 set(CMAKE_VEC_C_FLAGS "${CMAKE_VEC_FLAGS} -qsimd=auto") 95 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -qsimd=noauto") 96 # "long vector" optimizations 97 #set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -qhot=novector") 98 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -qreport") 99 100 elseif (CMAKE_C_COMPILER_ID MATCHES "Cray") 101 set(VECTOR_ALIASING_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} -h restrict=a") 102 set(VECTOR_C_OPTS "${VECTOR_C_OPTS} -h vector=3") 103 104 set(VECTOR_NOVEC_C_OPT "${VECTOR_NOVEC_C_OPT} -h vector=0") 105 set(VECTOR_C_VERBOSE "${VECTOR_C_VERBOSE} -h msgs -h negmsgs -h list=a") 106 107 endif() 108 109 set(VECTOR_BASE_C_FLAGS "${VECTOR_ALIASING_C_FLAGS} ${VECTOR_ARCH_C_FLAGS} ${VECTOR_FPMODEL_C_FLAGS}") 110 set(VECTOR_NOVEC_C_FLAGS "${VECTOR_BASE_C_FLAGS} ${VECTOR_NOVEC_C_OPT}") 111 set(VECTOR_C_FLAGS "${VECTOR_BASE_C_FLAGS} ${VECTOR_C_OPTS} ${VECTOR_C_FPOPTS} ${VECTOR_OPENMP_SIMD_C_FLAGS}") 112 endif()
❶ Sets all flags and turns on vectorization
❷ Sets all flags but disables vectorization
❸ Turns on verbose messages when compiling for vectorization feedback
❹ Stricter aliasing option to help auto-vectorization
❺ Set to compile for architecture that it’s on
❻ Set so that Kahan sum does not get optimized out (unsafe optimizations)
❼ Turns off vectorization for debugging and performance measurement
随着 OpenMP 4.0 标准的发布,我们可以选择使用一组更便携的 SIMD 指令。这些指令以命令而不是提示的形式实现。我们已经在 6.3.3 节中看到了这些指令的使用。这些指令只能用于请求向量化,也可以与 for/do 指令结合使用以请求线程化和向量化。C 和 C++ 编译指示的一般语法是
With the release of the OpenMP 4.0 standard, we have the option of using a more portable set of SIMD directives. These directives are implemented as commands rather than hints. We have already seen the use of these directives in section 6.3.3. The directives can be used to only request vectorization, or these can be combined with the for/do directive to request both threading and vectorization. The general syntax for C and C++ pragmas is
#pragma omp simd / Vectorizes the following loop or block of code #pragma omp for simd / Threads and vectorizes the following loop
#pragma omp simd / Vectorizes the following loop or block of code #pragma omp for simd / Threads and vectorizes the following loop
The general syntax for Fortran directives is
!$omp simd / Vectorizes the following loop or block of code !$omp do simd / Threads and vectorizes the following loop
!$omp simd / Vectorizes the following loop or block of code !$omp do simd / Threads and vectorizes the following loop
基本 SIMD 指令可以补充其他子句,以传达更多信息。最常见的附加子句是 private 子句的某种变体。此子句通过为每个 vector lane创建一个单独的私有变量来打破 false 依赖关系。语法的一个示例是
The basic SIMD directive can be supplemented with additional clauses to communicate more information. The most common additional clause is some variant of the private clause. This clause breaks false dependencies by creating a separate, private variable for each vector lane. An example of the syntax is
#pragma omp simd private(x)
for (int i=0; i<n; i++){
x=array(i);
y=sqrt(x)*x;
}#pragma omp simd private(x)
for (int i=0; i<n; i++){
x=array(i);
y=sqrt(x)*x;
}
对于简单的 private 子句,建议 C 和 C++ 程序员使用的方法是只在循环中定义变量以明确您的意图:
For a simple private clause, the recommended approach for C and C++ programmers is to just define the variable in the loop to make clear your intent:
double x=array(i);
double x=array(i);
firstprivate 子句使用进入循环的值初始化每个线程的私有变量,而 lastprivate 子句将循环后的变量设置为它在循环的顺序形式中具有的逻辑最后一个值。
The firstprivate clause initializes the private variable for each thread with the value coming into the loop, while the lastprivate clause sets the variable after the loop to the logically last value it would have in a sequential form of the loop.
reduction 子句为每个通道创建一个私有变量,然后在循环结束时在每个通道的值之间执行指定的操作。为每个向量泳道初始化 reduction 变量,以用于指定的操作。
The reduction clause creates a private variable for each lane and then performs the specified operation between the values for each lane at the end of the loop. The reduction variables are initialized for each vector lane as would make sense for the specified operation.
aligned 子句告诉编译器数据在 64 字节边界上对齐,因此不需要生成剥离循环。对齐的数据可以更有效地加载到 vector registers 中。但首先,需要使用 memory alignment 来分配内存。您可以使用许多不同的函数来对齐内存,但可移植性仍然存在问题。以下是一些可能性:
The aligned clause tells the compiler that the data is aligned on a 64-byte boundary so that peel loops do not need to be generated. Aligned data can be loaded into vector registers more efficiently. But first, the memory needs to be allocated with memory alignment. There are many different functions that you can use to get aligned memory, but there are still issues with portability. Here are some of the possibilities:
void *memalign(size_t alignment, size_t size); int posix_memalign(void **memptr, size_t alignment, size_t size); void *aligned_alloc(size_t alignment, size_t size); void *aligned_malloc(size_t alignment, size_t size);
void *memalign(size_t alignment, size_t size); int posix_memalign(void **memptr, size_t alignment, size_t size); void *aligned_alloc(size_t alignment, size_t size); void *aligned_malloc(size_t alignment, size_t size);
You can also use attributes to a memory definition to specify memory alignment:
double x[100] __attribute__((aligned(64)));
double x[100] __attribute__((aligned(64)));
另一个重要的修饰符是 collapse 子句。它告诉编译器将嵌套循环合并为一个循环以进行向量化实现。子句的参数指示要折叠的循环数:
Another important modifier is the collapse clause. It tells the compiler to combine nested loops into a single loop for the vectorized implementation. The argument to the clause indicates how many loops to collapse:
#pragma omp collapse(2)
for (int j=0; j<n; j++){
for (int i=0; i<n; i++){
x[j][i] = 0.0;
}
}#pragma omp collapse(2)
for (int j=0; j<n; j++){
for (int i=0; i<n; i++){
x[j][i] = 0.0;
}
}
循环需要完美嵌套。完美嵌套的循环仅在最内层的循环中具有语句,每个循环块之前或之后没有无关的语句。以下子句适用于更特殊的情况:
The loops are required to be perfectly nested. Perfectly nested loops only have statements in the innermost loop, with no extraneous statements before or after each loop block. The following clauses are for more specialized cases:
The linear clause says that the variable changes for every iteration by some linear function.
safelen 子句告诉编译器依赖项由指定的长度分隔,这允许编译器对短于或等于 safe length 子句参数的向量长度进行向量化。
The safelen clause tells the compiler that the dependencies are separated by the specified length, which allows the compiler to vectorize for vector lengths shorter than or equal to the safe length clause argument.
The simdlen clause generates vectorization of the specified length instead of the default length.
我们还可以对整个函数或子例程进行向量化,以便可以从代码的向量化区域内调用它。C/C++ 和 Fortran 的语法略有不同。对于 C/C++,我们将使用一个示例,其中使用勾股定理计算点数组的径向距离:
We can also vectorize an entire function or subroutine so that it can be called from within a vectorized region of the code. The syntax is a little different for C/C++ and Fortran. For C/C++, we’ll use an example where the radial distance of an array of points is calculated using the Pythagorean theorem:
#pragma omp declare simd
double pythagorean(double a, double b){
return(sqrt(a*a + b*b));
}#pragma omp declare simd
double pythagorean(double a, double b){
return(sqrt(a*a + b*b));
}
对于 Fortran,必须将子例程或函数名称指定为 SIMD 子句的参数:
For Fortran, the subroutine or function name must be specified as an argument to the SIMD clause:
subroutine pythagorean(a, b, c) !$omp declare simd(pythagorean) real*8 a, b, c c = sqrt(a**2+b**2) end subroutine pythagorean
subroutine pythagorean(a, b, c) !$omp declare simd(pythagorean) real*8 a, b, c c = sqrt(a**2+b**2) end subroutine pythagorean
OpenMP SIMD 函数指令还可以采用一些相同的子句和一些新的子句,如下所示:
The OpenMP SIMD function directive can also take some of the same clauses and some new ones as follows:
The inbranch or notinbranch clause informs the compiler whether the function is called from within a conditional or not.
The uniform clause says that the argument specified in the clause stays constant for all calls and does not need to be set up as a vector in the vectorized call.
linear(ref, val, uval) 子句向编译器指定 clause 参数中的变量是某种形式的线性变量。例如,Fortran 通过引用以及在传递后续数组位置时传递参数。在前面的 Fortran 示例中,子句将如下所示:
The linear(ref, val, uval) clause specifies to the compiler that the variable in the clause argument is linear in some form. For example, Fortran passes arguments by reference and when it passes subsequent array locations. In the previous Fortran example, the clause would look like this:
!$omp declare simd(pythagorean) linear(ref(a, b, c))
!$omp declare simd(pythagorean) linear(ref(a, b, c))
该子句还可用于指定该值是线性的,以及该步骤是否是更大的常量,就像在 stramded access 中可能出现的那样。
The clause can also be used to specify that the value is linear and whether the step is a larger constant as might occur in a strided access.
您不会找到很多关于向量化的可用材料。进一步探索的最佳方法是尝试对较小的代码块进行向量化,并尝试使用您常用的编译器。话虽如此,英特尔有很多简短的向量化指南,它们是最好和最新的资源。在 Intel 网站上查找最新材料。
You won’t find a lot of available materials on vectorization. The best approach for further explorations is to try vectorizing a smaller code block and experiment with the compilers that you commonly use. That being said, Intel has a lot of brief vectorization guides that are the best and most current resources. Look on the Intel website for the latest materials.
Cray Corporation 的 John Levesque 最近写了一本关于向量化的书,其中有一章很不错:
John Levesque, Cray Corporation, has authored a recent book with a good chapter on vectorization:
John Levesque 和 Aaron Vose,混合多核/众核 MPP 系统编程,(CRC 出版社,2017 年)。
John Levesque and Aaron Vose, Programming for Hybrid Multi/Manycore MPP Systems, (CRC Press, 2017).
Agner Fog 在他的优化指南中提供了一些关于向量化的最佳参考资料,例如:
Agner Fog has some of the best references on vectorization in his optimization guides, for example:
Agner Fog,“使用 C++ 优化软件:Windows、Linux 和 Mac 平台的优化指南”,2004-2018(上次更新时间为 2018 年 8 月)。
Agner Fog, “Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms,” 2004-2018 (last updated Aug, 2018).
Agner Fog,“VCL C++向量类库”,v. 1.30 (2012-2017),https://www.agner.org/optimize/vectorclass.pdf 年以 PDF 格式提供。
Agner Fog, “VCL C++ vector class library,” v. 1.30 (2012-2017) available as a PDF at https://www.agner.org/optimize/vectorclass.pdf.
在第 4.3 节 (https://github.com/LANL/MultiMatTest.git) 中尝试从多材料代码中自动向量化循环。添加 vectorization 和 loop report 标志,看看编译器会告诉您什么。
Experiment with auto-vectorizing loops from the multimaterial code in section 4.3 (https://github.com/LANL/MultiMatTest.git). Add the vectorization and loop report flags and see what your compiler tells you.
Add OpenMP SIMD pragmas to help the compiler vectorize loops to the loop you selected in the first exercise.
对于其中一个向量内部示例,将向量长度从 4 个双精度值更改为 8 个宽向量宽度。查看本章的源代码,了解 8 宽实现的工作代码示例。
For one of the vector-intrinsic examples, change the vector length from four double-precision values to an eight-wide vector width. Check the source code for this chapter for examples of working code for eight-wide implementations.
If you are on an older CPU, does your program from exercise 3 successfully run? What is the performance impact?
自动和手动向量化都可以为您的代码提供显著的性能改进。为了强调这一点
Both auto- and manual vectorization can provide significant performance improvements for your code. To underscore this
We show several different methods for vectorizing code with different levels of control, effort, and performance.
We provide a list of programming styles to achieve vectorization.
1. 需要注意的是,尽管自动向量化通常会显著提高性能,但有时也会减慢代码速度。这是因为设置 vector 指令的开销大于性能提升。编译器通常决定使用 cost 函数进行向量化。如果 cost 函数显示代码会更快,则编译器会进行向量化,但它会猜测数组长度,并假设所有数据都来自缓存的第一级。
1. It is important to note that although auto-vectorization often yields significant performance gains, it can sometimes slow down the code. This is due to the overhead of setting up the vector instructions being greater than the performance gain. The compiler generally makes a decision to vectorize using a cost function. The compiler vectorizes if the cost function shows that the code would be faster, but it is guessing at the array lengths and assumes all the data is from the first level of the cache.
随着多核架构的规模和普及率的增长,线程级并行性的细节成为软件性能的关键因素。在本章中,我们首先介绍共享内存编程标准开放式多处理 (OpenMP) 的基础知识,以及为什么对 OpenMP 的工作原理有基本的了解很重要。我们将查看难度不同的示例问题,从简单的常见“Hello World”示例到使用 OpenMP 并行化的复杂拆分方向模板实现。我们将全面分析 OpenMP 指令与底层操作系统内核之间的交互,以及内存层次结构和硬件功能。最后,我们将研究一种有前途的高级 OpenMP 编程方法,用于未来的超大规模应用。我们表明,高级 OpenMP 对于包含许多短计算工作循环的算法是有效的。
As many-core architectures grow in size and popularity, the details of thread-level parallelism become a critical factor in software performance. In this chapter, we first introduce the basics of Open Multi-Processing (OpenMP), a shared memory programming standard, and why it’s important to have a fundamental understanding of how OpenMP functions. We will look at sample problems ranging in difficulty from a simple common “Hello World” example to a complex split-direction stencil implementation with OpenMP parallelization. We will thoroughly analyze the interaction between OpenMP directives and the underlying OS kernel, as well as the memory hierarchy and hardware features. Finally, we will investigate a promising high-level approach to OpenMP programming for future extreme-scale applications. We show that high-level OpenMP is efficient for algorithms containing many short loops of computational work.
与更标准的线程方法相比,高级 OpenMP 范例可降低线程开销成本、同步等待、缓存抖动和内存使用。鉴于这些优势,现代并行计算程序员 (您) 必须同时了解共享和分布式内存编程范例。我们将在第 8 章中讨论消息传递接口 (MPI) 的分布式内存编程范例。
When compared to more standard-threading approaches, the high-level OpenMP paradigm leads to a reduction in thread overhead costs, synchronization waits, cache thrashing, and memory usage. Given these advantages, it is essential that the modern parallel computing programmer (you) knows both shared and distributed memory programming paradigms. We discuss the distributed memory programming paradigm in chapter 8 on the Message Passing Interface (MPI).
注意您可以在 https:// github.com/EssentialsofParallelComputing/Chapter7 中找到本章的随附源代码。
Note You’ll find the accompanying source code for this chapter at https:// github.com/EssentialsofParallelComputing/Chapter7.
OpenMP 是线程和共享内存并行编程支持最广泛的开放标准之一。在本节中,我们将解释标准、易用性、预期收益、困难和内存模型。
OpenMP is one of the most widely supported open standards for threads and shared-memory parallel programming. In this section, we will explain the standard, ease of use, expected gains, difficulties, and the memory models.
您今天看到的 OpenMP 版本需要一些时间来开发,并且仍在不断发展。OpenMP 的起源始于 1990 年代初期几家硬件供应商推出其实施方案。1994 年,在 ANSI X3H5 草案标准中标准化这些实现的尝试失败了。直到 90 年代末引入大规模多核系统,才刺激了 OpenMP 方法的重新出现,从而在 1997 年产生了第一个 OpenMP 标准。
The version of OpenMP that you see today took some time to develop and is still evolving. The origin of OpenMP began when several hardware vendors introduced their implementations in the early 1990s. A failed attempt was made in 1994 to standardize these implementations in the ANSI X3H5 draft standard. It was not until the introduction of wide-scale, multi-core systems in the late ’90s that a re-emergence of the OpenMP approach was spurred, leading to the first OpenMP standard in 1997.
今天,OpenMP 提供了一个标准的可移植 API,用于使用线程编写共享内存并行程序;众所周知,它易于使用,允许快速实现,并且只需要少量增加代码,这通常在 pragma 或 directive 的上下文中看到。pragma (C/C++) 或指令 (Fortran) 向编译器指示在何处启动 OpenMP 线程。这些术语 pragma 和 directive 通常可以互换使用。编译指示是 C 和 C++ 中的预处理器语句。指令在 Fortran 中作为注释编写,以便程序在不使用 OpenMP 时保留标准语言语法。尽管使用 OpenMP 需要支持它的编译器,但大多数编译器都标配了这种支持。
Today, OpenMP provides a standard and portable API for writing shared-memory parallel programs using threads; it’s known to be easy to use, allowing for fast implementation, and requires only a small increase in code, normally seen in the context of pragmas or directives. A pragma (C/C++) or directive (Fortran) indicates to the compiler where to initiate OpenMP threads. These terms, pragma and directive, are often used interchangeably. Pragmas are preprocessor statements in C and C++. Directives are written as comments in Fortran in order for the program to retain the standard language syntax when OpenMP is not used. Although using OpenMP requires a compiler that supports it, most compilers come standard with that support.
OpenMP 使初学者能够实现并行化,从而可以轻松有趣地介绍将应用程序扩展到一个内核之外。通过轻松使用 OpenMP 编译指示和指令,可以快速并行执行代码块。在图 7.1 中,您可以看到 OpenMP 和 MPI 所需的工作量和获得的性能的概念视图(在第 8 章中讨论)。使用 OpenMP 通常是扩展应用程序的第一步。
OpenMP makes parallelization achievable for a beginner, thus allowing for an easy and fun introduction to scaling an application beyond one core. With the easy use of OpenMP pragmas and directives, a block of code can be quickly executed in parallel. In figure 7.1, you can see a conceptual view of the effort required and the performance obtained for OpenMP and MPI (discussed in chapter 8). Using OpenMP will often be the first exciting step into scaling an application.
尽管使用 OpenMP 很容易实现适度的并行性,但彻底优化可能是一项挑战。困难的根源是允许线程争用条件存在的松散内存模型。所谓 relaxed,我们的意思是主内存中变量的值不会立即更新。对于变量的每次更改,这样做的成本太高了。由于更新延迟,每个线程对共享变量的内存操作之间的微小计时差异可能会导致每次运行的结果不同。让我们看看一些定义:
Although it is easy to achieve modest parallelism with OpenMP, thorough optimization can be a challenge. The source of the difficulty is the relaxed memory model that permits thread race conditions to exist. By relaxed, we mean that the value of the variables in main memory are not updated immediately. It would be too expensive to do so for every change in variables. Because of the delay in the updates, minor timing differences between memory operations by each thread on shared variables have the potential to cause different results from run-to-run. Let’s look at some definitions:
Relaxed memory model—The value of the variables in main memory or caches of all the processors are not updated immediately.
Race condition—A situation where multiple outcomes are possible, and the result is dependent on the timing of the contributors.
图 7.1 使用 MPI 或 OpenMP 提高性能所需的编程工作的概念可视化
Figure 7.1 Conceptual visualization of the programming effort required to improve performance using either MPI or OpenMP
OpenMP 最初用于使用共享内存多处理器上的线程并行化高度规则的循环。在线程并行构造中,每个变量可以是共享变量或私有变量。术语 shared 和 private 对 OpenMP 具有特定的含义。以下是它们的定义:
OpenMP was initially used to parallelize highly regular loops using threads on shared memory multiprocessors. Within a threaded parallel construct, each variable can be either shared or private. The terms shared and private have a particular meaning for OpenMP. Here are their definitions:
Private variable—In the context of OpenMP, a private variable is local and only visible to its thread.
Shared variable—In the context of OpenMP, a shared variable is visible and modifiable by any thread.
要真正理解这些术语,需要对如何管理线程应用程序的内存有一个基本的看法。如图 7.2 所示,每个线程在其堆栈中都有一个私有内存,并在堆中共享内存。
Truly understanding these terms requires a fundamental view of how memory is managed for a threaded application. As figure 7.2 shows, each thread has a private memory in its stack and shares memory in the heap.
图 7.2 线程内存模型有助于了解哪些变量是共享的,哪些是私有的。每个线程(由波浪线显示)都有自己的指令指针、堆栈指针和堆栈内存,但共享堆和静态内存数据。
Figure 7.2 The threaded memory model helps with understanding which variables are shared and which are private. Each thread, shown by the squiggly lines, has its own instruction pointer, stack pointer, and stack memory but shares the heap and static memory data.
OpenMP 指令指定工作共享,但未说明内存或数据位置。作为程序员,您必须了解变量内存范围的隐式规则。OS 内核可以使用多种技术来管理 OpenMP 和线程的内存。最常见的技术是首次接触概念,其中内存分配最接近首次接触它的线程。我们将工作共享和首次联系定义为
OpenMP directives specify work sharing but say nothing about the memory or data location. As a programmer, you must understand the implicit rules for the memory scope of variables. The OS kernel can use several techniques to manage memory for OpenMP and threading. The most common technique is the first touch concept, where memory is allocated nearest to the thread where it is first touched. We define work sharing and first touch as
Work sharing—To split the work across a number of threads or processes.
First touch — 数组的第一次触摸会导致内存被分配。内存分配在发生触摸的线程位置附近。在第一次触摸之前,内存仅作为虚拟内存表中的一个条目存在。与虚拟内存对应的物理内存是在首次访问虚拟内存时创建的。
First touch—The first touch of an array causes the memory to be allocated. The memory is allocated near the thread location where the touch occurs. Prior to the first touch, the memory only exists as an entry in a virtual memory table. The physical memory that corresponds to the virtual memory is created when it is first accessed.
首次接触很重要的原因是,在许多高端、高性能的计算节点上,存在多个内存区域。当有多个内存区域时,通常会有从 CPU 及其进程到内存不同部分的非一致性内存访问 (NUMA),这为优化代码性能增加了一个重要的考虑因素。
The reason that first touch is important is that on many high-end, high-performance computing nodes, there are multiple memory regions. When there are multiple memory regions, there is often Non-Uniform Memory Access (NUMA) from a CPU and its processes to different portions of memory, adding an important consideration for optimizing code performance.
定义在某些计算节点上,内存块比其他块更靠近某些处理器。这种情况称为非一致性内存访问 (NUMA)。当节点有两个 CPU 插槽,每个插槽都有自己的内存时,通常会出现这种情况。处理器访问另一个 NUMA 域中的内存所需的时间(损失)通常是访问其自身内存的两倍。
Definition On some computing nodes, blocks of memory are closer to some processors than others. This situation is called Non-Uniform Memory Access (NUMA). This is often the case when a node has two CPU sockets with each socket having its own memory. A processor’s access to memory in the other NUMA domain typically takes twice the time (penalty) as it does to access its own memory.
此外,由于 OpenMP 具有宽松的内存模型,因此线程的内存视图需要 OpenMP 屏障或刷新操作才能传达给其他线程。flush 操作保证值在两个线程之间移动,从而防止出现争用条件。OpenMP barrier 会刷新所有本地修改的值并同步线程。如何完成值的更新是硬件和操作系统中的复杂操作。
Moreover, because OpenMP has a relaxed memory model, an OpenMP barrier or flush operation is required for the memory view of a thread to be communicated to other threads. A flush operation guarantees that a value moves between two threads, preventing race conditions. An OpenMP barrier flushes all the locally modified values and synchronizes the threads. How this updating of the values is done is a complicated operation in the hardware and operating system.
在共享内存、多核系统上,必须将 cache 中修改的值刷新到主内存并更新。较新的 CPU 使用专用硬件来确定实际更改的内容,因此数十个内核中的缓存仅在必要时更新。但它仍然是一项昂贵的操作,并会迫使线程在等待更新时停止。在许多方面,它与您想从计算机中删除 U 盘时需要执行的操作类似;您必须告诉操作系统刷新所有 U 盘缓存,然后等待。使用频繁的 barrier 和 flush 的代码与较小的并行区域相结合,通常具有过度的同步,从而导致性能不佳。
On a shared-memory, multi-core system, the modified values in cache must be flushed to the main memory and updated. Newer CPUs use specialized hardware to determine what actually changed, so the cache in dozens of cores only updates if necessary. But it is still an expensive operation and forces threads to stall while waiting for updates. In many ways, it is a similar kind of operation to what you need to do when you want to remove a thumb drive from your computer; you have to tell the operating system to flush all the thumb drive caches and then wait. Codes that use frequent barriers and flushes combined with smaller parallel regions often have excessive synchronization leading to poor performance.
OpenMP 针对单个节点,而不是使用分布式内存架构处理多个节点。因此,其内存可扩展性仅限于节点上的内存。对于具有较大内存要求的并行应用程序,OpenMP 需要与分布式内存并行技术结合使用。我们将在第 8 章中讨论其中最常见的 MPI 标准。
OpenMP addresses a single node, not multiple nodes with distributed memory architectures. Thus, its memory scalability is limited to the memory on the node. For parallel applications that have larger memory requirements, OpenMP needs to be used in conjunction with a distributed-memory parallel technique. We discuss the most common of these, the MPI standard, in chapter 8.
表 7.1 显示了一些常见的 OpenMP 概念、术语和指令。我们将在本章的其余部分演示这些用法。
Table 7.1 shows some common OpenMP concepts, terminology, and directives. We will demonstrate the use of these in the rest of the chapter.
Table 7.1 Roadmap of OpenMP topics in this chapter
|
拆分在线程之间工作相同。调度子句包括 static、dynamic、guided 和 auto。 Splits work equally between threads. Scheduling clauses include static, dynamic, guided, and auto. |
||
|
Directives can also be combined for specific calls within routines. |
||
|
#pragma OMP 并行缩减 (+: sum)、(min: xmin) 或 (max: xmax) #pragma omp parallel for reduction (+: sum), (min: xmin), or (max: xmax) |
||
|
在多个线程运行时,此调用会创建一个停止点,以便所有线程都可以在移动到下一部分之前重新组合。 With multiple threads running, this call creates a stopping point so that all the threads can regroup before moving to the next section. |
||
|
#pragma omp 掩码在线程 0 上执行,在末尾1 没有屏障 #pragma omp masked executes on thread zero with no barrier at the end1 #pragma omp single)一个线程,在块的末尾有一个隐式屏障 #pragma omp single)one thread with an implicit barrier at the end of the block |
This directive prevents multiple threads from executing the code. 当您在 parallel region 中有一个只想在一个线程上运行的函数时,请使用此指令。 Use this directive when you have a function within a parallel region that you only want to run on one thread. |
|
现在,我们将向您展示如何应用每个 OpenMP 概念和指令。在本节中,您将学习如何使用 OpenMP parallel 编译指示在线程之间分布的传统“Hello World”问题创建具有多个线程的代码区域。您将看到使用 OpenMP 是多么容易,并且有可能实现性能提升。有几种方法可以控制并行区域中的线程数。这些是
Now we’ll show you how to apply each of the OpenMP concepts and directives. In this section, you will learn how to create a region of code with multiple threads using the OpenMP parallel pragma on a traditional “Hello World” problem distributed among threads. You will see how easy it is to use OpenMP and, potentially, to achieve performance gains. There are several ways to control how many threads you have in the parallel region. These are
Default (默认值) - 默认值通常是节点的最大线程数,但也可能有所不同,具体取决于编译器以及是否存在 MPI 排名。
Default—The default is usually the maximum number of threads for the node, but it can be different, depending on the compiler and if MPI ranks exist.
Environment variable—Set the size with the OMP_NUM_THREADS environment variable; for example
export OMP_NUM_THREADS=16
export OMP_NUM_THREADS=16
omp_set_threads(16)
omp_set_threads(16)
清单 7.1 到 7.6 中的简单示例显示了如何获取线程 ID 和线程数。清单 7.1 显示了我们第一次尝试编写 Hello World 程序。
The simple example in listings 7.1 through 7.6 shows how to get your thread ID and the number of threads. Listing 7.1 shows our first attempt at writing a Hello World program.
清单 7.1 一个简单的 hello OpenMP 程序,用于打印 Hello OpenMP
Listing 7.1 A simple hello OpenMP program that prints Hello OpenMP
HelloOpenMP/HelloOpenMP.c 1 #include <stdio.h> 2 #include <omp.h> ❶ 3 4 int main(int argc, char *argv[]){ 5 int nthreads, thread_id; 6 nthreads = omp_get_num_threads(); ❷ 7 thread_id = omp_get_thread_num(); ❷ 8 printf("Goodbye slow serial world and Hello OpenMP"); 9 printf("I have %d thread(s) and my thread id is %d\n",nthreads,thread_id); 10 }
HelloOpenMP/HelloOpenMP.c 1 #include <stdio.h> 2 #include <omp.h> ❶ 3 4 int main(int argc, char *argv[]){ 5 int nthreads, thread_id; 6 nthreads = omp_get_num_threads(); ❷ 7 thread_id = omp_get_thread_num(); ❷ 8 printf("Goodbye slow serial world and Hello OpenMP"); 9 printf("I have %d thread(s) and my thread id is %d\n",nthreads,thread_id); 10 }
❶ 包括用于 OpenMP 函数调用的 OpenMP 头文件(必需)
❶ Includes OpenMP header file for the OpenMP function calls (mandatory)
❷ Function calls to get the number of threads and the thread ID
gcc -fopenmp -o HelloOpenMP HelloOpenMP.c
gcc -fopenmp -o HelloOpenMP HelloOpenMP.c
where -fopen is the compiler flag to turn on OpenMP.
接下来,我们将通过设置环境变量来设置程序要使用的线程数。我们还可以使用函数调用 omp_set_num_threads() 或让 OpenMP 根据我们运行的硬件来选择线程数。要设置线程数,请使用以下命令设置环境变量:
Next, we’ll set the number of threads for the program to use by setting an environment variable. We could also use the function call omp_set_num_threads() or just let OpenMP pick the number of threads based on the hardware that we are running on. To set the number of threads, use this command to set the environment variable:
export OMP_NUM_THREADS=4
export OMP_NUM_THREADS=4
现在,使用 ./HelloOpenMP 运行您的可执行文件,我们得到
Now, run your executable with ./HelloOpenMP, we get
Goodbye slow serial world and Hello OpenMP! I have 1 thread(s) and my thread id is 0
Goodbye slow serial world and Hello OpenMP! I have 1 thread(s) and my thread id is 0
不完全是我们想要的;只有一个线程。我们必须添加一个并行区域才能获得多个线程。清单 7.2 展示了如何添加并行区域。
Not quite what we wanted; there is only one thread. We have to add a parallel region to get multiple threads. Listing 7.2 shows how to add the parallel region.
注意在整章的列表中,您将看到 Spawn threads >> >> 和 Implied Barrier Implied Barrier 的注释。这些是显示线程生成位置以及编译器插入屏障位置的视觉提示。在后面的清单中,我们将对 Explicit Barrier 使用相同的注释 Explicit Barrier,其中我们插入了 barrier 指令。
Note In listings throughout the chapter, you’ll see the annotations >> Spawn threads >> and Implied Barrier Implied Barrier. These are visual cues to show where threads are spawned and where barriers are inserted by the compiler. In later listings, we'll use the same annotations for Explicit Barrier Explicit Barrier where we have inserted a barrier directive.
Listing 7.2 Adding a parallel region to Hello OpenMP
HelloOpenMP/HelloOpenMP_fix1.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 int nthreads, thread_id;
6 #pragma omp parallel >> Spawn threads >> ❶
7 {
8 nthreads = omp_get_num_threads();
9 thread_id = omp_get_thread_num();
10 printf("Goodbye slow serial world and Hello OpenMP!\n");
11 printf(" I have %d thread(s) and my thread id is %d\n",nthreads,thread_id);
12 } Implied Barrier Implied Barrier
13}HelloOpenMP/HelloOpenMP_fix1.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 int nthreads, thread_id;
6 #pragma omp parallel >> Spawn threads >> ❶
7 {
8 nthreads = omp_get_num_threads();
9 thread_id = omp_get_thread_num();
10 printf("Goodbye slow serial world and Hello OpenMP!\n");
11 printf(" I have %d thread(s) and my thread id is %d\n",nthreads,thread_id);
12 } Implied Barrier Implied Barrier
13}
With these changes, we get the following output:
Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 Goodbye slow serial world and Hello OpenMP! Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 I have 4 thread(s) and my thread id is 3
Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 Goodbye slow serial world and Hello OpenMP! Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 I have 4 thread(s) and my thread id is 3
如您所见,所有线程都报告它们是线程编号 3。这是因为 nthreads 和 thread_id 是共享变量。在运行时分配给这些变量的值是由执行指令的最后一个线程写入的值。这是典型的争用条件,如图 7.3 所示。这是任何类型的线程程序中的常见问题。
As you can see, all of the threads report that they are thread number 3. This is because nthreads and thread_id are shared variables. The value that is assigned at run time to these variables is the one written by the last thread to execute the instruction. This is a typical race condition as figure 7.3 illustrates. It is a common issue in threaded programs of any type.
图 7.3 上一个示例中的变量是在 parallel 区域之前定义的,因此这些是堆中的共享变量。每个线程都写入这些线程,最终值由最后写入的线程决定。阴影表示各种线程以非确定性方式在不同的 clock cycles 进行写入的时间进程。这种情况和类似情况称为争用条件,因为结果可能因运行而异。
Figure 7.3 Variables in the previous example are defined before the parallel region, thus these are shared variables in the heap. Each thread writes to these, and the final value is determined by which one writes last. The shading represents progression through time with writes at different clock cycles by various threads in a non-deterministic fashion. This situation and similar situations are called race conditions because the results can vary from run to run.
另请注意,打印输出的顺序是随机的,具体取决于每个处理器的写入顺序以及它们如何刷新到标准输出设备。为了获得正确的线程编号,我们在循环中定义 thread_id 变量,以便变量的范围成为线程的私有范围,如下面的清单所示。
Also note that the order of the printout is random, depending on the order of the writes from each processor and how they get flushed to the standard output device. To get the right thread numbers, we define the thread_id variable in the loop so that the scope of the variable becomes private to the thread as the following listing shows.
清单 7.3 定义 Hello OpenMP 中使用这些变量的变量
Listing 7.3 Defining variables where these are used in Hello OpenMP
HelloOpenMP/HelloOpenMP_fix2.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 #pragma omp parallel >> Spawn threads >>
6 {
7 int nthreads = omp_get_num_threads(); ❶
8 int thread_id = omp_get_thread_num(); ❶
9 printf("Goodbye slow serial world and Hello OpenMP!\n");
10 printf(" I have %d thread(s) and my thread id is %d\n",nthreads,thread_id);
11 } Implied Barrier Implied Barrier
12 }HelloOpenMP/HelloOpenMP_fix2.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 #pragma omp parallel >> Spawn threads >>
6 {
7 int nthreads = omp_get_num_threads(); ❶
8 int thread_id = omp_get_thread_num(); ❶
9 printf("Goodbye slow serial world and Hello OpenMP!\n");
10 printf(" I have %d thread(s) and my thread id is %d\n",nthreads,thread_id);
11 } Implied Barrier Implied Barrier
12 }
❶ 移动到并行区域的 nthread 和 thread_id 的定义。
❶ Definition of nthreads and thread_id moved into the parallel region.
Goodbye slow serial world and Hello OpenMP! Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 2 ❶ Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 ❶ Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 0 ❶ I have 4 thread(s) and my thread id is 1 ❶
Goodbye slow serial world and Hello OpenMP! Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 2 ❶ Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 3 ❶ Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 0 ❶ I have 4 thread(s) and my thread id is 1 ❶
❶ Now we get a different thread ID for each thread.
假设我们真的不希望每根线都打印出来。让我们最小化输出,并将 print 语句放在单个 OpenMP 子句中,如下面的清单所示,因此只有一个线程写入输出。
Say we really didn’t want every thread printing out. Let’s minimize the output and put the print statement in a single OpenMP clause as the following listing shows, so only one thread writes output.
清单 7.4 添加单个编译指示来打印 Hello OpenMP 的输出
Listing 7.4 Adding a single pragma to print output for Hello OpenMP
HelloOpenMP/HelloOpenMP_fix3.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 #pragma omp parallel >> Spawn threads >>
6 {
7 int nthreads = omp_get_num_threads(); ❶
8 int thread_id = omp_get_thread_num(); ❶
9 #pragma omp single ❷
10 { ❷
11 printf("Number of threads is %d\n",nthreads); ❷
12 printf("My thread id %d\n",thread_id); ❷
13 } Implied Barrier Implied Barrier ❷
14 } Implied Barrier Implied Barrier
15 }HelloOpenMP/HelloOpenMP_fix3.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 #pragma omp parallel >> Spawn threads >>
6 {
7 int nthreads = omp_get_num_threads(); ❶
8 int thread_id = omp_get_thread_num(); ❶
9 #pragma omp single ❷
10 { ❷
11 printf("Number of threads is %d\n",nthreads); ❷
12 printf("My thread id %d\n",thread_id); ❷
13 } Implied Barrier Implied Barrier ❷
14 } Implied Barrier Implied Barrier
15 }
❶ Variables defined in a parallel region are private.
❷ 将 output 语句放入 OpenMP 单个 pragma 块中
❷ Places output statements into an OpenMP single pragma block
Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 2
Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 2
线程 ID 在每次运行时都是不同的值。在这里,我们确实希望打印出来的线程是第一个线程,因此我们将下一个列表中的 OpenMP 子句更改为使用 masked 而不是 single。
The thread ID is a different value on each run. Here, we really wanted the thread that prints out to be the first thread, so we change the OpenMP clause in the next listing to use masked instead of single.
列表 7.5 在 Hello OpenMP 中将单个 pragma 更改为掩码 pragma
Listing 7.5 Changing a single pragma to a masked pragma in Hello OpenMP
HelloOpenMP/HelloOpenMP_fix4.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 #pragma omp parallel >> Spawn threads >>
6 {
7 int nthreads = omp_get_num_threads();
8 int thread_id = omp_get_thread_num();
9 #pragma omp masked ❶
10 {
11 printf("Goodbye slow serial world and Hello OpenMP!\n");
12 printf(" I have %d thread(s) and my thread id is %d\n",nthreads,thread_id);
13 }
14 } Implied Barrier Implied Barrier
15 }HelloOpenMP/HelloOpenMP_fix4.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 #pragma omp parallel >> Spawn threads >>
6 {
7 int nthreads = omp_get_num_threads();
8 int thread_id = omp_get_thread_num();
9 #pragma omp masked ❶
10 {
11 printf("Goodbye slow serial world and Hello OpenMP!\n");
12 printf(" I have %d thread(s) and my thread id is %d\n",nthreads,thread_id);
13 }
14 } Implied Barrier Implied Barrier
15 }
❶ Adds directive to run only on main thread
Running this code now returns what we were first trying to do:
Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 0
Goodbye slow serial world and Hello OpenMP! I have 4 thread(s) and my thread id is 0
我们可以使此操作更加简洁,并使用更少的编译指示,如清单 7.6 所示。第一个 print 语句不需要位于 parallel 区域中。此外,我们可以通过简单地使用线程编号的条件来将第二个打印输出限制为线程 0。隐含的 barrier 来自 omp parallel pragma。
We can make this operation even more concise and use fewer pragmas as we show in listing 7.6. The first print statement does not need to be in the parallel region. Also, we can limit the second printout to thread zero by simply using a conditional on the thread number. The implied barrier is from the omp parallel pragma.
清单 7.6 减少 Hello OpenMP 中的 pragma 数量
Listing 7.6 Reducing the number of pragmas in Hello OpenMP
HelloOpenMP/HelloOpenMP_fix5.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 printf("Goodbye slow serial world and Hello OpenMP!\n"); ❶
6 #pragma omp parallel >> Spawn threads >> ❷
7 if (omp_get_thread_num() == 0) { ❸
8 printf(" I have %d thread(s) and my thread id is %d\n",
omp_get_num_threads(), omp_get_thread_num());
9 }
10 Implied Barrier Implied Barrier
11 }HelloOpenMP/HelloOpenMP_fix5.c
1 #include <stdio.h>
2 #include <omp.h>
3
4 int main(int argc, char *argv[]){
5 printf("Goodbye slow serial world and Hello OpenMP!\n"); ❶
6 #pragma omp parallel >> Spawn threads >> ❷
7 if (omp_get_thread_num() == 0) { ❸
8 printf(" I have %d thread(s) and my thread id is %d\n",
omp_get_num_threads(), omp_get_thread_num());
9 }
10 Implied Barrier Implied Barrier
11 }
❶ Moves print statement out of parallel region
❷ Pragma 适用于 next 语句或由大括号分隔的范围块。
❷ Pragma applies to next statement or a scoping block delimited by curly braces.
❸ Replaces OpenMP masked pragma with conditional for thread zero
We have learned a few important things from this example:
Variables that are defined outside a parallel region are by default shared in the parallel region.
We should always strive to have the smallest program scope for a variable that is still correct. By defining the variable in the loop, the compiler can better understand our intent and handle it correctly.
使用 masked 子句比使用 single 子句更具限制性,因为它需要线程 0 来执行代码块。masked 子句的末尾也没有隐式屏障。
Using the masked clause is more restrictive than the single clause because it requires thread 0 to execute the code block. The masked clause also does not have an implicit barrier at the end.
We need to watch out for possible race conditions between the operations of different threads.
OpenMP 不断更新和发布新版本。在使用 OpenMP 实施之前,您应该了解其版本和支持的功能。OpenMP 最初能够跨单个节点利用线程。OpenMP 标准中添加了向量化和将任务卸载到加速器(如 GPU)等新功能。下表显示了过去十年中添加的一些主要功能。
OpenMP is continuously updating and releasing new versions. Before using an OpenMP implementation, you should know the version and the features that are supported. OpenMP started with the ability to harness threads across a single node. New capabilities, such as vectorization, and offloading tasks to accelerators, such as GPUs, have been added to the OpenMP standard. The following table shows some of the major features added in the last decade.
需要注意的一点是,为了应对硬件中发生的重大变化,自 2011 年以来,OpenMP 的变化速度已经加快。虽然版本 3.0 和 3.1 中的更改主要涉及标准 CPU 线程模型,但从那时起,版本 4.0、4.5 和 5.0 中的更改主要涉及其他形式的硬件并行性,例如加速器和向量化。
One thing to note is that to deal with the substantial changes in hardware that are occurring, the pace of changes to OpenMP has increased since 2011. While the changes in version 3.0 and 3.1 dealt mostly with the standard CPU threading model, since then the changes in versions 4.0, 4.5, and 5.0 have mostly dealt with other forms of hardware parallelism, such as accelerators and vectorization.
OpenMP 具有三种特定的用例场景,可满足三种不同类型用户的需求。您需要做出的第一个决定是哪种情况适合您的情况。每种情况的策略和技术都有所不同:循环级 OpenMP、高级 OpenMP 和用于增强 MPI 实施的 OpenMP。在以下部分中,我们将详细说明其中的每一个,何时使用它们,为什么使用它们,以及如何使用它们。图 7.4 显示了每个用例需要仔细阅读的推荐材料。
OpenMP has three specific use-case scenarios to meet the needs of three different types of users. The first decision you need to make is which scenario is appropriate for your situation. The strategy and techniques vary for each of these cases: loop-level OpenMP, high-level OpenMP, and OpenMP to enhance MPI implementations. In the following sections, we will elaborate on each of these, when to use them and why, and how to use them. Figure 7.4 shows the recommended material to be carefully read for each of the use cases.
Figure 7.4 The recommended reading for each of the scenarios depends on the use case for your application.
循环级 OpenMP 的一个标准用例是,您的应用程序只需要适度的加速并具有大量内存资源。我们的意思是,单个硬件节点上的内存可以满足它的要求。在此用例中,使用循环级 OpenMP 可能就足够了。以下列表总结了循环级 OpenMP 的应用程序特征:
A standard use case for loop-level OpenMP is when your application only needs a modest speedup and has plenty of memory resources. By this we mean that its requirements can be satisfied by the memory on a single hardware node. In this use case, it might be sufficient to use loop-level OpenMP. The following list summarizes the application characteristics of loop-level OpenMP:
在这些情况下,我们使用循环级 OpenMP,因为它只需要很少的工作量并且可以快速完成。使用 pragma 的单独 parallel,可以减少线程争用条件的问题。通过将 OpenMP parallel for pragmas 或并行 do 指令放在关键循环之前,可以轻松实现循环的并行性。即使最终目标是更高效的实现,这种循环级方法通常是将线程并行引入应用程序的第一步。
We use loop-level OpenMP in these cases because it takes little effort and can be done quickly. With separate parallel for pragmas, the issue of thread race conditions is reduced. By placing OpenMP parallel for pragmas or parallel do directives before key loops, the parallelism of the loop can be easily achieved. Even when the end goal is a more efficient implementation, this loop-level approach is often the first step when introducing thread parallelism to an application.
注意如果您的使用案例只需要适度的加速,请转到 第 7.3 节 以获取此方法的示例。
Note If your use case requires only modest speedup, go to section 7.3 for examples of this approach.
接下来,我们将讨论另一种方案,即需要更高性能的高级 OpenMP。我们的高级 OpenMP 设计与标准循环级 OpenMP 的策略截然不同。标准 OpenMP 从自下而上开始,并在循环级别应用并行结构。我们的高级 OpenMP 方法采用自上而下的方法,从整体系统角度看待设计,解决内存系统、系统内核和硬件问题。OpenMP 语言不会改变,但其使用方法会改变。最终结果是,我们消除了许多线程启动成本和同步成本,这些成本阻碍了循环级 OpenMP 的可扩展性。
Next we discuss a different scenario, high-level OpenMP, where higher performance is desired. Our high-level OpenMP design has a radical difference from the strategies for standard loop-level OpenMP. Standard OpenMP starts from the bottom-up and applies the parallelism constructs at the loop level. Our high-level OpenMP approach takes a whole system view to the design with a top-down approach that addresses the memory system, the system kernel, and the hardware. The OpenMP language does not change, but the method of its use does. The end result is that we eliminate many of the thread startup costs and the costs of synchronization that hobble the scalability of loop-level OpenMP.
如果您需要从应用程序中提取每一点性能,那么高级 OpenMP 非常适合您。首先学习第 7.3 节中的循环级 OpenMP 作为应用程序的起点。然后,您需要从第 7.4 节和第 7.5 节中更深入地了解 OpenMP 变量范围。最后,深入研究第 7.6 节,了解高级 OpenMP 与循环级方法截然相反的方法如何带来更好的性能。在该部分中,我们将查看实现模型和实现所需结构的分步方法。接下来是高级 OpenMP 的详细实施示例。
If you need to extract every last bit of performance out of your application, then high-level OpenMP is for you. Begin by learning loop-level OpenMP in section 7.3 as a starting point for your application. Then you will need to gain a deeper understanding of OpenMP variable scope from sections 7.4 and 7.5. Finally, dive into section 7.6 for a look at how the diametrically opposite approach of high-level OpenMP from the loop-level approach results in better performance. In that section, we’ll look at the implementation model and a step-by-step method to reach the desired structure. This is followed by detailed examples of implementations for high-level OpenMP.
我们还可以使用 OpenMP 来补充分布式内存并行性(如第 8 章所述)。在一小部分进程上使用 OpenMP 的基本思想增加了另一个级别的并行实现,有助于实现极端扩展。这可以是在节点内,或者更好的是,统一共享对共享内存的快速访问的处理器集,通常称为非一致性内存访问 (NUMA) 区域。
We can also use OpenMP to supplement distributed memory parallelism (as discussed in chapter 8). The basic idea of using OpenMP on a small subset of processes adds another level of parallel implementation that helps for extreme scaling. This could be within the node, or better yet, the set of processors that uniformly share quick access to shared memory, commonly referred to as a Non-Uniform Memory Access (NUMA) region.
我们首先在第 7.1.1 节中讨论了 OpenMP 概念中的 NUMA 区域,作为性能优化的额外考虑因素。通过仅在一个内存区域内使用线程,所有内存访问的成本都相同,可以避免 OpenMP 的一些复杂性和性能陷阱。在更温和的混合实现中,OpenMP 可用于利用每个处理器的 2 到 4 个超线程。在描述 MPI 的基础知识之后,我们将在第 8 章讨论此方案,即混合 MPI + OpenMP。
We first discussed NUMA regions in OpenMP concepts in section 7.1.1 as an additional consideration for performance optimization. By using threading only within one memory region where all memory accesses have the same cost, some of the complexity and performance traps of OpenMP are avoided. In a more modest hybrid implementation, OpenMP can be used to harness the two-to-four hyperthreads for each processor. We’ll discuss this scenario, the hybrid MPI + OpenMP, in chapter 8 after describing the basics of MPI.
对于这种具有小线程数的混合方法所需的 OpenMP 技能,学习第 7.3 节中的循环级 OpenMP 技术就足够了。然后逐步转向更高效且可扩展的 OpenMP 实现,它允许越来越多的线程替换 MPI 等级。这至少需要 7.6 节中介绍的高级 OpenMP 路径上的一些步骤。现在,您已经了解了哪些部分对于应用程序的使用案例很重要,让我们深入了解如何使每种策略发挥作用的详细信息。
For the OpenMP skills needed for this hybrid approach with small thread counts, it is sufficient to learn the loop-level OpenMP techniques in section 7.3. Then move incrementally to a more efficient and scalable OpenMP implementation, which allows more and more threads to replace MPI ranks. This requires at least some of the steps on the path to high-level OpenMP as presented in section 7.6. Now that you know what sections are important for your application’s use case, let’s jump into the details of how to make each strategy work.
在本节中,我们将查看循环级并行化的示例。循环级用例在 7.2.1 节中介绍;在这里,我们将向您展示实施细节。让我们开始吧。
In this section, we will look at examples of loop-level parallelization. The loop-level use case was introduced in section 7.2.1; here we will show you the implementation details. Let’s begin.
并行区域是通过在代码块周围插入编译指示来启动的,这些代码块可以在独立线程之间划分(例如,do 循环、for 循环)。OpenMP 依赖操作系统内核进行内存处理。这种对内存处理的依赖通常是限制 OpenMP 发挥其峰值潜力的重要因素。我们将看看为什么会这样。并行构造中的每个变量可以是共享的,也可以是私有的。此外,OpenMP 具有宽松的内存模型。每个线程都有一个临时的内存视图,因此它不会在每次操作时存储内存。当临时视图最终必须与主内存协调时,需要 OpenMP 屏障或刷新操作来同步内存。这些同步中的每一个都有一定的成本,因为执行刷新需要花费时间,但也因为它需要快速线程等待较慢的线程完成。了解 OpenMP 功能如何减少这些性能瓶颈。
Parallel regions are initiated by inserting pragmas around blocks of code that can be divided among independent threads (e.g., do loops, for loops). OpenMP relies on the OS kernel for its memory handling. This reliance for memory handling can often be an important factor that limits OpenMP from reaching its peak potential. We’ll look at why this happens. Each variable within a parallel construct can be either shared or private. Moreover, OpenMP has a relaxed memory model. Each thread has a temporary view of memory so that it doesn’t have the cost of storing memory with every operation. When the temporary view finally must be reconciled with main memory, an OpenMP barrier or flush operation is required to synchronize memory. Each of these synchronizations comes with a cost, due to the time that it takes to perform the flush, but also because it requires fast threads to wait for slower ones to complete. An understanding of how OpenMP functions can reduce these performance bottlenecks.
性能并不是 OpenMP 程序员唯一关心的问题。您还应该注意由线程争用条件引起的正确性问题。线程在处理器上可能以不同的速度前进,再加上松散的内存同步,即使是经过充分测试的代码,也可能突然出现严重错误。仔细编程和使用第 7.9.2 节中讨论的专用工具对于强大的 OpenMP 应用程序至关重要。
Performance is not the only concern for an OpenMP programmer. You should also watch for correctness issues caused by thread race conditions. Threads might progress at different speeds on the processors and, in combination with the relaxed memory synchronization, serious errors can suddenly occur in even well-tested code. Careful programming and the use of specialized tools as discussed in section 7.9.2 is essential for robust OpenMP applications.
在本节中,我们将查看一些循环级 OpenMP 示例,以了解它在实践中的使用方式。本章随附的源代码包含每个示例的更多变体。我们强烈建议您在通常使用的架构和编译器上尝试其中的每一个。我们在 Skylake Gold 6152 双插槽系统以及 2017 年的 Mac 笔记本电脑上运行了每个示例。线程由内核分配,并使用以下 OpenMP 环境变量启用线程绑定,以减少运行的性能变化:
In this section, we’ll take a look at a few loop-level OpenMP examples to get an idea of how it is used in practice. The source code that accompanies the chapter has more variants of each example. We strongly encourage you to experiment with each of these on the architecture and compiler that you commonly work with. We ran each of the examples on a Skylake Gold 6152 dual socket system, as well as a 2017 Mac laptop. Threads are allocated by cores, and thread binding is enabled using the following OpenMP environment variables to reduce the performance variation of runs:
export OMP_PLACES=cores export OMP_CPU_BIND=true
export OMP_PLACES=cores export OMP_CPU_BIND=true
我们将在第 14 章中更多地探讨线程放置和绑定。现在,为了帮助您获得循环级 OpenMP 的经验,我们将提供三个不同的示例:向量添加、流三元组和模板代码。我们将在第 7.3.4 节的最后一个示例之后展示三个示例的并行加速。
We’ll explore thread placement and binding more in chapter 14. For now, to help you get experience with loop-level OpenMP, we’ll present three different examples: vector addition, stream triad, and a stencil code. We’ll show the parallel speedup of the three examples after the last example in section 7.3.4.
在向量添加示例(清单 7.7)中,您可以看到三个组件之间的交互:OpenMP 工作共享指令、隐含变量范围和操作系统的内存放置。这三个组件对于 OpenMP 程序的正确性和性能是必需的。
In the vector addition example (listing 7.7), you can see the interaction between the three components: OpenMP work-sharing directives, implied variable scope, and memory placement by the operating system. These three components are necessary for OpenMP program correctness and performance.
清单 7.7 使用简单的循环级 OpenMP 编译指示进行 Vector add
Listing 7.7 Vector add with a simple loop-level OpenMP pragma
VecAdd/vecadd_opt1.c 1 #include <stdio.h> 2 #include <time.h> 3 #include <omp.h> 4 #include "timer.h" 5 6 #define ARRAY_SIZE 80000000 ❶ 7 static double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE]; 8 9 void vector_add(double *c, double *a, double *b, int n); 10 11 int main(int argc, char *argv[]){ 12 #pragma omp parallel >> Spawn threads >> 13 if (omp_get_thread_num() == 0) 14 printf("Running with %d thread(s)\n",omp_get_num_threads()); Implied Barrier Implied Barrier 15 16 struct timespec tstart; 17 double time_sum = 0.0; 18 for (int i=0; i<ARRAY_SIZE; i++) { ❷ 19 a[i] = 1.0; ❷ 20 b[i] = 2.0; ❷ 21 } ❷ 22 23 cpu_timer_start(&tstart); 24 vector_add(c, a, b, ARRAY_SIZE); 25 time_sum += cpu_timer_stop(tstart); 26 27 printf("Runtime is %lf msecs\n", time_sum); 28 } 29 30 void vector_add(double *c, double *a, double *b, int n) 31 { 32 #pragma omp parallel for >> Spawn threads >> ❸ 33 for (int i=0; i < n; i++){ ❹ 34 c[i] = a[i] + b[i]; ❹ 35 } ❹ Implied Barrier Implied Barrier 36 }
VecAdd/vecadd_opt1.c 1 #include <stdio.h> 2 #include <time.h> 3 #include <omp.h> 4 #include "timer.h" 5 6 #define ARRAY_SIZE 80000000 ❶ 7 static double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE]; 8 9 void vector_add(double *c, double *a, double *b, int n); 10 11 int main(int argc, char *argv[]){ 12 #pragma omp parallel >> Spawn threads >> 13 if (omp_get_thread_num() == 0) 14 printf("Running with %d thread(s)\n",omp_get_num_threads()); Implied Barrier Implied Barrier 15 16 struct timespec tstart; 17 double time_sum = 0.0; 18 for (int i=0; i<ARRAY_SIZE; i++) { ❷ 19 a[i] = 1.0; ❷ 20 b[i] = 2.0; ❷ 21 } ❷ 22 23 cpu_timer_start(&tstart); 24 vector_add(c, a, b, ARRAY_SIZE); 25 time_sum += cpu_timer_stop(tstart); 26 27 printf("Runtime is %lf msecs\n", time_sum); 28 } 29 30 void vector_add(double *c, double *a, double *b, int n) 31 { 32 #pragma omp parallel for >> Spawn threads >> ❸ 33 for (int i=0; i < n; i++){ ❹ 34 c[i] = a[i] + b[i]; ❹ 35 } ❹ Implied Barrier Implied Barrier 36 }
❶ Array is large enough to force into main memory.
❸ Single-combined OpenMP parallel for pragma
这种特定的实现方式在单个节点上产生适度的并行性能。请注意,此实现可能会更好。在主循环之前的初始化期间,所有数组内存首先由主线程触及,如图 7.5 左侧所示。这可能会导致内存位于不同的内存区域中,其中某些线程的内存访问时间更长。
This particular implementation style produces modest parallel performance on a single node. Take note, this implementation could be better. All the array memory is first touched by the main thread during the initialization prior to the main loop as shown on the left in figure 7.5. This can cause the memory to be located in a different memory region, where the memory access time is greater for some of the threads.
图 7.5 在主向量添加计算循环(左侧)上添加单个 OpenMP 编译指示会导致主线程首先接触 a 和 b 数组;数据在线程 0 附近分配。c 数组在计算循环期间首先被触及,因此 c 数组的内存接近每个线程。在右侧,在初始化循环上添加 OpenMP 编译指示会导致 a 和 b 数组的内存放置在完成工作的线程附近。
Figure 7.5 Adding a single OpenMP pragma on the main vector add computation loop (on the left) results in the a and b arrays being touched first by the main thread; the data is allocated near thread zero. The c array is first touched during the computation loop and, therefore, the memory for the c array is close to each thread. On the right, adding an OpenMP pragma on the initialization loop results in the memory for the a and b arrays being placed near the thread where the work is done.
现在,为了提高 OpenMP 性能,我们在初始化循环中插入编译指示,如清单 7.8 所示。这些循环分布在同一个静态线程分区中,因此在初始化循环中接触内存的线程的内存将作系统定位在靠近它们的附近(如图 7.5 的右侧所示)。
Now, to improve the OpenMP performance, we insert pragmas in the initialization loops as listing 7.8 shows. The loops are distributed in the same static threading partition, so the threads that touch the memory in the initialization loop will have the memory located near to them by the operating system (shown on the right side of figure 7.5).
Listing 7.8 Vector add with first touch
VecAdd/vecadd_opt2.c
11 int main(int argc, char *argv[]){
12 #pragma omp parallel >> Spawn threads >>
13 if (omp_get_thread_num() == 0)
14 printf("Running with %d thread(s)\n",omp_get_num_threads());
Implied Barrier Implied Barrier
15
16 struct timespec tstart;
17 double time_sum = 0.0;
18 #pragma omp parallel for >> Spawn threads >> ❶
19 for (int i=0; i<ARRAY_SIZE; i++) { ❷
20 a[i] = 1.0; ❷
21 b[i] = 2.0; ❷
22 } ❷
Implied Barrier Implied Barrier
23
24 cpu_timer_start(&tstart);
25 vector_add(c, a, b, ARRAY_SIZE);
26 time_sum += cpu_timer_stop(tstart);
27
28 printf("Runtime is %lf msecs\n", time_sum);
29 }
30
31 void vector_add(double *c, double *a, double *b, int n)
32 {
33 #pragma omp parallel for >> Spawn threads >> ❸
34 for (int i=0; i < n; i++){ ❹
35 c[i] = a[i] + b[i]; ❹
36 } ❹
Implied Barrier Implied Barrier
37 }VecAdd/vecadd_opt2.c
11 int main(int argc, char *argv[]){
12 #pragma omp parallel >> Spawn threads >>
13 if (omp_get_thread_num() == 0)
14 printf("Running with %d thread(s)\n",omp_get_num_threads());
Implied Barrier Implied Barrier
15
16 struct timespec tstart;
17 double time_sum = 0.0;
18 #pragma omp parallel for >> Spawn threads >> ❶
19 for (int i=0; i<ARRAY_SIZE; i++) { ❷
20 a[i] = 1.0; ❷
21 b[i] = 2.0; ❷
22 } ❷
Implied Barrier Implied Barrier
23
24 cpu_timer_start(&tstart);
25 vector_add(c, a, b, ARRAY_SIZE);
26 time_sum += cpu_timer_stop(tstart);
27
28 printf("Runtime is %lf msecs\n", time_sum);
29 }
30
31 void vector_add(double *c, double *a, double *b, int n)
32 {
33 #pragma omp parallel for >> Spawn threads >> ❸
34 for (int i=0; i < n; i++){ ❹
35 c[i] = a[i] + b[i]; ❹
36 } ❹
Implied Barrier Implied Barrier
37 }
❶ 在 “parallel for” 编译指示中初始化,以便第一次触摸将内存获取到正确的位置
❶ Initialization in a “parallel for” pragma so first touch gets memory in the proper location
❷ Initializes the a and b arrays
❸ OpenMP for pragma 用于跨线程分配向量添加循环的工作
❸ OpenMP for pragma to distribute work for vector add loop across threads
第二个 NUMA 区域中的线程不再具有较慢的内存访问时间。这提高了第二个 NUMA 区域中线程的内存带宽,还改善了线程之间的负载平衡。首次联系是前面第 7.1.1 节中提到的操作系统策略。良好的首次接触实现通常可以获得 10% 到 20% 的性能提升。有关情况的证据,请参阅第 7.3.4 节中的表 7.2,了解这些示例的性能改进。
The threads in the second NUMA region no longer have a slower memory access time. This improves the memory bandwidth for the threads in the second NUMA region and also improves the load balance across the threads. First touch is an OS policy that was mentioned earlier in section 7.1.1. Good first touch implementations may often gain a 10 to 20% performance improvement. For evidence that this is the case, see table 7.2 in section 7.3.4 for the performance improvement on these examples.
如果在 BIOS 中启用了 NUMA,则 Skylake Gold 6152 CPU 在访问远程内存时的性能会降低大约两倍。与大多数可调参数一样,各个系统的配置可能会有所不同。要查看您的配置,您可以使用适用于 Linux 的 numactl 和 numastat 命令。您可能必须为这些命令安装 numactl-libs 或 numactl-devel 软件包。
If NUMA is enabled in the BIOS, the Skylake Gold 6152 CPU has a factor of about two decrease in performance when accessing remote memory. As with most tunable parameters, the configuration of individual systems can vary. To see your configuration, you can use the numactl and numastat commands for Linux. You may have to install the numactl-libs or numactl-devel packages for these commands.
图 7.6 显示了 Skylake Gold 测试平台的输出。输出末尾列出的节点距离大致捕获了访问远程节点上的内存的开销。您可以将其视为获取内存的相对跃点数。这里的内存访问成本略高于 2 倍(21 对 10)。请注意,有时会列出两个 NUMA 区域系统,默认配置的成本为 20 对 10,而不是它们的实际成本。
Figure 7.6 shows the output for the Skylake Gold test platform. The node distances listed at the end of the output roughly capture the cost of accessing memory on a remote node. You can think of this as the relative number of hops to get to memory. Here the memory access cost is a little over a factor of two (21 versus 10). Note that sometimes two NUMA region systems are listed with a cost of 20 versus 10 as a default configuration instead of their real costs.
图 7.6 numactl 和 numastat 命令的输出。内存区域之间的距离将突出显示。请注意,NUMA 实用程序使用术语 “node” 的方式与我们定义的不同。在他们的术语中,每个 NUMA 区域都是一个节点。我们将节点术语保留给单独的分布式存储系统,例如机架式系统中的另一个桌面或托盘。
Figure 7.6 Output from the numactl and numastat commands. The distance between memory regions is highlighted. Note that the NUMA utilities use the term “node” differently than we have defined it. In their terminology, each NUMA region is a node. We reserve the node terminology for a separate distributed memory system such as another desktop or tray in a rack-mounted system.
NUMA 配置信息可以告诉您哪些内容需要优化。如果您只有一个 NUMA 区域,或者内存访问成本的差异很小,则可能无需过多担心首次接触优化。如果系统配置为对 NUMA 区域的交错内存访问,则优化以实现更快的本地内存访问将无济于事。在没有特定信息的情况下,或者尝试针对大型 HPC 系统进行一般优化时,您应该使用首次接触优化来获得本地、更快的内存访问。
The NUMA configuration information can tell you what is important to optimize. If you only have one NUMA region, or the difference in memory access costs is small, you may not need to worry as much about first touch optimizations. If the system is configured for interleaved memory accesses to the NUMA regions, optimizing for the faster local memory accesses will not help. In the absence of specific information or when trying to optimize in general for larger HPC systems, you should use first touch optimizations to get local, faster memory accesses.
下面的清单显示了 stream triad benchmark 的另一个类似示例。此示例运行内核的多次迭代以获得平均性能。
The following listing shows another similar example for the stream triad benchmark. This example runs multiple iterations of the kernel to get an average performance.
Listing 7.9 Loop-level OpenMP threading of the stream triad
StreamTriad/stream_triad_opt2.c 1 #include <stdio.h> 2 #include <time.h> 3 #include <omp.h> 4 #include "timer.h" 5 6 #define NTIMES 16 7 #define STREAM_ARRAY_SIZE 80000000 ❶ 8 static double a[STREAM_ARRAY_SIZE], b[STREAM_ARRAY_SIZE], c[STREAM_ARRAY_SIZE]; 9 10 int main(int argc, char *argv[]){ 11 #pragma omp parallel >> Spawn threads >> 12 if (omp_get_thread_num() == 0) 13 printf("Running with %d thread(s)\n",omp_get_num_threads()); Implied Barrier Implied Barrier 14 15 struct timeval tstart; 16 double scalar = 3.0, time_sum = 0.0; ❷ 17 #pragma omp parallel for >> Spawn threads >> 18 for (int i=0; i<STREAM_ARRAY_SIZE; i++) { ❷ 19 a[i] = 1.0; ❷ 20 b[i] = 2.0; ❷ 21 } ❷ Implied Barrier Implied Barrier 22 23 for (int k=0; k<NTIMES; k++){ 24 cpu_timer_start(&tstart); 25 #pragma omp parallel for >> Spawn threads >> 26 for (int i=0; i<STREAM_ARRAY_SIZE; i++){ ❸ 27 c[i] = a[i] + scalar*b[i]; ❸ 28 } ❸ Implied Barrier Implied Barrier 29 time_sum += cpu_timer_stop(tstart); 30 c[1]=c[2]; ❹ 31 } 32 33 printf("Average runtime is %lf msecs\n", time_sum/NTIMES); 34 }
StreamTriad/stream_triad_opt2.c 1 #include <stdio.h> 2 #include <time.h> 3 #include <omp.h> 4 #include "timer.h" 5 6 #define NTIMES 16 7 #define STREAM_ARRAY_SIZE 80000000 ❶ 8 static double a[STREAM_ARRAY_SIZE], b[STREAM_ARRAY_SIZE], c[STREAM_ARRAY_SIZE]; 9 10 int main(int argc, char *argv[]){ 11 #pragma omp parallel >> Spawn threads >> 12 if (omp_get_thread_num() == 0) 13 printf("Running with %d thread(s)\n",omp_get_num_threads()); Implied Barrier Implied Barrier 14 15 struct timeval tstart; 16 double scalar = 3.0, time_sum = 0.0; ❷ 17 #pragma omp parallel for >> Spawn threads >> 18 for (int i=0; i<STREAM_ARRAY_SIZE; i++) { ❷ 19 a[i] = 1.0; ❷ 20 b[i] = 2.0; ❷ 21 } ❷ Implied Barrier Implied Barrier 22 23 for (int k=0; k<NTIMES; k++){ 24 cpu_timer_start(&tstart); 25 #pragma omp parallel for >> Spawn threads >> 26 for (int i=0; i<STREAM_ARRAY_SIZE; i++){ ❸ 27 c[i] = a[i] + scalar*b[i]; ❸ 28 } ❸ Implied Barrier Implied Barrier 29 time_sum += cpu_timer_stop(tstart); 30 c[1]=c[2]; ❹ 31 } 32 33 printf("Average runtime is %lf msecs\n", time_sum/NTIMES); 34 }
❶ Large enough to force into main memory
❹ Keeps the compiler from optimizing the loop
同样,我们只需要在第 25 行使用一个 pragma 来实现 OpenMP 线程计算。在第 17 行插入的第二个 pragma 进一步提高了性能,因为通过适当的首次触摸技术获得了更好的内存放置。
Again, we just need one pragma to implement the OpenMP threaded computation at line 25. A second pragma inserted at line 17 further improves performance because of the better memory placement obtained by a proper first touch technique.
循环级 OpenMP 的第三个示例是第 1 章中首次介绍的模板操作(图 1.10)。此模板运算符将周围的邻居相加,并取单元格新值的平均值。Listing 7.10 具有更复杂的内存读取访问模式,随着我们优化例程,它向我们展示了线程访问其他线程写入的内存的效果。在第一个循环级 OpenMP 实现中,默认情况下,每个 parallel for 块都是同步的,这可以防止潜在的竞争条件。在以后的模板的更优化版本中,我们将添加显式同步指令。
The third example of loop-level OpenMP is the stencil operation first introduced in chapter 1 (figure 1.10). This stencil operator adds the surrounding neighbors and takes an average for the new value of the cell. Listing 7.10 has more complex memory read access patterns and, as we optimize the routine, it shows us the effect of threads accessing memory written by other threads. In this first loop-level OpenMP implementation, each parallel for block is synchronized by default, which prevents potential race conditions. In later, more optimized versions of the stencil, we’ll add explicit synchronization directives.
清单 7.10 模板示例中的循环级 OpenMP 线程,首次触摸
Listing 7.10 Loop-level OpenMP threading in the stencil example with first touch
Stencil/stencil_opt2.c
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <time.h>
4 #include <omp.h>
5
6 #include "malloc2D.h"
7 #include "timer.h"
8
9 #define SWAP_PTR(xnew,xold,xtmp) (xtmp=xnew, xnew=xold, xold=xtmp)
10
11 int main(int argc, char *argv[])
12 {
13 #pragma omp parallel >> Spawn threads >>
14 #pragma omp masked
15 printf("Running with %d thread(s)\n",omp_get_num_threads());
Implied Barrier Implied Barrier
16
17 struct timeval tstart_init, tstart_flush, tstart_stencil, tstart_total;
18 double init_time, flush_time, stencil_time, total_time;
19 int imax=2002, jmax = 2002;
20 double** xtmp;
21 double** x = malloc2D(jmax, imax);
22 double** xnew = malloc2D(jmax, imax);
23 int *flush = (int *)malloc(jmax*imax*sizeof(int)*4);
24
25 cpu_timer_start(&tstart_total);
26 cpu_timer_start(&tstart_init);
27 #pragma omp parallel for >> Spawn threads >> ❶
28 for (int j = 0; j < jmax; j++){
29 for (int i = 0; i < imax; i++){
30 xnew[j][i] = 0.0;
31 x[j][i] = 5.0;
32 }
33 } Implied Barrier Implied Barrier
34
35 #pragma omp parallel for >> Spawn threads >> ❶
36 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
37 for (int i = imax/2 - 5; i < imax/2 -1; i++){
38 x[j][i] = 400.0;
39 }
40 } Implied Barrier Implied Barrier
41 init_time += cpu_timer_stop(tstart_init);
42
43 for (int iter = 0; iter < 10000; iter++){
44 cpu_timer_start(&tstart_flush);
45 #pragma omp parallel for >> Spawn threads >> ❷
46 for (int l = 1; l < jmax*imax*4; l++){
47 flush[l] = 1.0;
48 } Implied Barrier Implied Barrier
49 flush_time += cpu_timer_stop(tstart_flush);
50 cpu_timer_start(&tstart_stencil);
51 #pragma omp parallel for >> Spawn threads >> ❷
52 for (int j = 1; j < jmax-1; j++){
53 for (int i = 1; i < imax-1; i++){
54 xnew[j][i]=(x[j][i] + x[j][i-1] + x[j][i+1] +
x[j-1][i] + x[j+1][i])/5.0;
55 }
56 } Implied Barrier Implied Barrier
57 stencil_time += cpu_timer_stop(tstart_stencil);
58
59 SWAP_PTR(xnew, x, xtmp);
60 if (iter%1000 == 0) printf("Iter %d\n",iter);
61 }
62 total_time += cpu_timer_stop(tstart_total);
63
64 printf("Timing: init %f flush %f stencil %f total %f\n",
65 init_time,flush_time,stencil_time,total_time);
66
67 free(x);
68 free(xnew);
69 free(flush);
70 }Stencil/stencil_opt2.c
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <time.h>
4 #include <omp.h>
5
6 #include "malloc2D.h"
7 #include "timer.h"
8
9 #define SWAP_PTR(xnew,xold,xtmp) (xtmp=xnew, xnew=xold, xold=xtmp)
10
11 int main(int argc, char *argv[])
12 {
13 #pragma omp parallel >> Spawn threads >>
14 #pragma omp masked
15 printf("Running with %d thread(s)\n",omp_get_num_threads());
Implied Barrier Implied Barrier
16
17 struct timeval tstart_init, tstart_flush, tstart_stencil, tstart_total;
18 double init_time, flush_time, stencil_time, total_time;
19 int imax=2002, jmax = 2002;
20 double** xtmp;
21 double** x = malloc2D(jmax, imax);
22 double** xnew = malloc2D(jmax, imax);
23 int *flush = (int *)malloc(jmax*imax*sizeof(int)*4);
24
25 cpu_timer_start(&tstart_total);
26 cpu_timer_start(&tstart_init);
27 #pragma omp parallel for >> Spawn threads >> ❶
28 for (int j = 0; j < jmax; j++){
29 for (int i = 0; i < imax; i++){
30 xnew[j][i] = 0.0;
31 x[j][i] = 5.0;
32 }
33 } Implied Barrier Implied Barrier
34
35 #pragma omp parallel for >> Spawn threads >> ❶
36 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
37 for (int i = imax/2 - 5; i < imax/2 -1; i++){
38 x[j][i] = 400.0;
39 }
40 } Implied Barrier Implied Barrier
41 init_time += cpu_timer_stop(tstart_init);
42
43 for (int iter = 0; iter < 10000; iter++){
44 cpu_timer_start(&tstart_flush);
45 #pragma omp parallel for >> Spawn threads >> ❷
46 for (int l = 1; l < jmax*imax*4; l++){
47 flush[l] = 1.0;
48 } Implied Barrier Implied Barrier
49 flush_time += cpu_timer_stop(tstart_flush);
50 cpu_timer_start(&tstart_stencil);
51 #pragma omp parallel for >> Spawn threads >> ❷
52 for (int j = 1; j < jmax-1; j++){
53 for (int i = 1; i < imax-1; i++){
54 xnew[j][i]=(x[j][i] + x[j][i-1] + x[j][i+1] +
x[j-1][i] + x[j+1][i])/5.0;
55 }
56 } Implied Barrier Implied Barrier
57 stencil_time += cpu_timer_stop(tstart_stencil);
58
59 SWAP_PTR(xnew, x, xtmp);
60 if (iter%1000 == 0) printf("Iter %d\n",iter);
61 }
62 total_time += cpu_timer_stop(tstart_total);
63
64 printf("Timing: init %f flush %f stencil %f total %f\n",
65 init_time,flush_time,stencil_time,total_time);
66
67 free(x);
68 free(xnew);
69 free(flush);
70 }
❶ 使用 OpenMP 编译指示进行初始化,以实现首次接触内存分配
❶ Initializes with OpenMP pragma for first-touch memory allocation
❷ Inserts parallel for pragma to thread loop
在此示例中,我们在第 46 行插入了一个 flush 循环,以清空 x 和 xnew 数组的缓存。这是为了模拟代码在缓存中没有先前操作的变量的性能。缓存中没有数据的情况称为冷缓存,当数据在缓存中时,称为暖缓存。冷缓存和暖缓存都是针对不同使用案例场景进行分析的有效案例。简单地说,这两种情况在实际应用程序中都是可能的,如果不进行深入分析,甚至可能很难知道会发生什么。
For this example, we inserted a flush loop at line 46 to empty the cache of the x and xnew arrays. This is to mimic the performance where a code does not have the variables in cache from a prior operation. The case without data in cache is termed a cold cache, and when the data is in cache it is called a warm cache. Both cold and warm caches are valid cases to analyze for different use-case scenarios. Simply, both cases are possible in a real application, and it may be difficult to even know which will happen without a deep analysis.
让我们回顾一下本节中前面示例的性能。如清单 7.8、7.9 和 7.10 所示,引入循环级 OpenMP 只需要对源代码进行少量更改。如表 7.2 所示,性能提升大约快了 10 倍。对于所需的努力,这是一个相当不错的性能回报。但对于具有 88 个线程的系统,实现的并行效率适中,如下计算约为 19%,这给了我们一些改进的空间。为了计算加速比,我们首先将串行运行时间除以并行运行时间,如下所示:
Let’s review the performance of the earlier examples in this section. As seen in listings 7.8, 7.9, and 7.10, introducing loop-level OpenMP requires few changes to the source code. As table 7.2 demonstrates, the performance improvement is on the order of 10x faster. This is a pretty good performance return for the effort required. But for a system with 88 threads, the achieved parallel efficiency is modest, at about 19% as calculated below, giving us some room for improvement. To calculate the speedup, we first take the serial run time divided by the parallel run time like this:
模板加速比 = (串行运行时间)/(并行运行时间) = 快 17.0 倍
Stencil speedup = (serial run-time)/(parallel run-time) = 17.0 times faster
如果我们在 88 个线程上获得完美的加速,那将是 88。我们取实际的加速比除以理想的加速比 88 来计算并行效率:
If we get perfect speedup on 88 threads, it would be 88. We take the actual speedup and divide by the ideal speedup of 88 to calculate the parallel efficiency:
模板并行效率 = (模板加速比)/(理想加速比) = 17 / 88 = 19%
Stencil parallel efficiency = (stencil speedup)/(ideal speedup) = 17 / 88 = 19%
在较小的线程数下,并行效率要好得多;在四线程时,并行效率为 85%。在线程附近分配内存的影响很小,但很重要。在表 7.2 的时序中,第一个优化,即简单循环级 OpenMP,对于计算循环上的编译指示,只有 OpenMP 并行。首次接触的第二次优化为初始化循环的 pragma 添加了 OpenMP 并行。表 7.2 总结了简单 OpenMP 的性能改进,并增加了首次接触优化。使用的时间OMP_ PLACES=cores 和 OMP_CPU_BIND=true。
Parallel efficiency is much better at smaller thread counts; at four threads, parallel efficiency is at 85%. The effect of getting memory allocated close to the thread is small, but significant. In the timings in table 7.2, the first optimization, simple loop-level OpenMP, has just OpenMP parallel for pragmas on the computation loops. The second optimization with first touch adds the OpenMP parallel for pragmas for the initialization loops. Table 7.2 summarizes the performance improvements for simple OpenMP with the addition of a first touch optimization. The timings used OMP_ PLACES=cores and OMP_CPU_BIND=true.
表 7.2 所示为运行时间(以毫秒为单位)。使用 GCC 版本 8.2 编译器的 Skylake Gold 6152 双插槽节点在 88 个线程上的加速是 10 倍。在初始化时添加 OpenMP 编译指示以获得正确的首次接触内存分配,将带来额外的加速。
Table 7.2 Shown are the run times in msecs. The speedup on a Skylake Gold 6152 dual socket node with the GCC version 8.2 compiler is a factor of ten on 88 threads. Adding an OpenMP pragma on the initialization to get proper first-touch memory allocation returns an additional speedup.
分析使用 OpenMP 线程化的模板应用程序,我们观察到 OpenMP 开销消耗了 10-15% 的运行时间,包括线程等待和线程启动成本。我们可以通过采用高级 OpenMP 设计来减少 OpenMP 开销,我们将在第 7.6 节中讨论。
Profiling the stencil application threaded with OpenMP, we observe that 10-15% of the run time is consumed by OpenMP overhead, consisting of thread waits and thread startup costs. We can reduce the OpenMP overhead by adopting a high-level OpenMP design as we’ll discuss in section 7.6.
另一种常见的循环类型是 reduction。归约是 5.7 节中介绍的并行编程中的一种常见模式。归约是从数组开始并计算标量结果的任何操作。在 OpenMP 中,这也可以通过添加 reduction 子句在循环级 pragma 中轻松处理,如下面的清单所示。
Another common type of loop is a reduction. Reductions are a common pattern in parallel programming that were introduced in section 5.7. Reductions are any operation that starts with an array and calculates a scalar result. In OpenMP, this can also be handled easily in the loop-level pragma with the addition of a reduction clause as the following listing shows.
Listing 7.11 Global sum with OpenMP threading
GlobalSums/serial_sum_novec.c
1 double do_sum_novec(double* restrict var, long ncells)
2 {
3 double sum = 0.0; ❶
4 #pragma omp parallel for reduction(+:sum) ❷
5 for (long i = 0; i < ncells; i++){ ❶
6 sum += var[i]; ❶
7 } ❶
8
9 return(sum);
10 }GlobalSums/serial_sum_novec.c
1 double do_sum_novec(double* restrict var, long ncells)
2 {
3 double sum = 0.0; ❶
4 #pragma omp parallel for reduction(+:sum) ❷
5 for (long i = 0; i < ncells; i++){ ❶
6 sum += var[i]; ❶
7 } ❶
8
9 return(sum);
10 }
❷ 带有 reduction 子句的 OpenMP 并行 for 循环
❷ OpenMP parallel for loop with reduction clause
归约操作计算每个线程的局部和,然后将所有线程求和。缩减变量 sum 初始化为操作的相应值。在清单 7.11 的代码中,reduction 变量初始化为零。当我们不使用 OpenMP 时,仍然需要将第 3 行的 sum 变量初始化为零才能正常运行。
The reduction operation computes a local sum on each thread and then sums all the threads together. The reduction variable, sum, is initialized to the appropriate value for the operation. In the code in listing 7.11, the reduction variable is initialized to zero. The initialization of the sum variable to zero on line 3 is still needed for proper operation when we don’t use OpenMP.
循环级 OpenMP 可以应用于大多数循环,但不是所有循环。循环必须具有规范形式,以便 OpenMP 编译器可以应用工作共享操作。规范形式是程序员首先学习的传统、简单的循环实现。要求是
Loop-level OpenMP can be applied to most, but not all loops. The loop must have a canonical form so that the OpenMP compiler can apply the work-sharing operation. The canonical form is the traditional, straightforward loop implementation that is first learned by programmers. The requirements are that
您可以通过反转循环的顺序或更改循环操作的顺序来测试最后一个要求。如果答案发生变化,则循环具有循环携带的依赖项。CPU 上的向量化和 GPU 上的线程实现的循环携带依赖项也有类似的限制。这种循环携带的依赖项要求的相似之处被描述为细粒度并行化与分布式内存消息传递方法中使用的粗粒度结构。以下是一些定义:
You can test the last requirement by reversing the order of the loop or by changing the order of the loop operations. If the answer changes, the loop has loop-carried dependencies. There are similar restrictions on loop-carried dependencies for vectorization on the CPU and threading implementations on the GPU. The similarities of this loop-carried dependency requirement have been described as fine-grained parallelization versus the coarse-grained structure used in a distributed-memory, message-passing approach. Here are some definitions:
Fine-grained parallelization—A type of parallelism where computational loops or other small blocks of code are operated on by multiple processors or threads and may need frequent synchronization.
Coarse-grained parallelization — 一种并行性,其中处理器对大型代码块进行操作,同步不频繁。
Coarse-grained parallelization—A type of parallelism where the processor operates on large blocks of code with infrequent synchronization.
许多编程语言都提出了一种修改后的循环类型,它告诉编译器允许以某种形式应用循环级并行性。目前,在循环之前提供 pragma 或 directive 会提供此信息。
Many programming languages have proposed a modified loop-type that tells the compiler that loop-level parallelism is allowed to be applied in some form. For now, supplying a pragma or directive before the loop supplies this information.
要将应用程序或例程转换为高级 OpenMP,您需要了解变量范围。OpenMP 规范在许多范围界定细节上含糊不清。图 7.7 显示了编译器的范围规则。通常,堆栈上的变量被认为是私有的,而放置在堆中的变量是共享的(图 7.2)。对于高级 OpenMP,最重要的情况是如何在并行区域中的 called 例程中管理范围。
To convert an application or routine to high-level OpenMP, you need to understand variable scope. The OpenMP specifications are vague on many scoping details. Figure 7.7 shows the scoping rules for compilers. Generally, a variable on the stack is considered private, and those that are placed in the heap are shared (figure 7.2). For high-level OpenMP, the most important case is how to manage scope in a called routine in a parallel region.
Figure 7.7 Summary of thread scoping rules for OpenMP applications
在确定变量的范围时,您应该将更多精力放在表达式左侧的变量上。要获得正确的结果,要写入的变量的范围更为重要。请注意,私有变量在并行区域的入口和退出后是未定义的,如清单 7.12 所示。firstprivate 和 lastprivate 子句可以在特殊情况下修改此行为。如果一个变量是私有的,我们应该看到它在 parallel 块中使用之前被设置,而不是在 parallel 区域之后使用。如果变量是私有的,最好在循环中声明该变量,因为本地声明的变量与私有 OpenMP 变量的行为完全相同。长话短说,在循环中声明变量可以消除对行为应该是什么的任何混淆。它在循环之前或之后都不存在,因此不可能有错误的使用。
When determining the scope of variables, you should put more focus on variables on the left-hand side of an expression. The scope for variables that are being written to is more important to get correct. Note that private variables are undefined at entry and after the exit of a parallel region as listing 7.12 shows. The firstprivate and lastprivate clauses can modify this behavior in special cases. If a variable is private, we should see it set before it is used in the parallel block and not used after the parallel region. If a variable is intended to be private, it is best to declare the variable within the loop because a locally declared variable has exactly the same behavior as a private OpenMP variable. Long story short, declaring the variable within the loop eliminates any confusion on what the behavior should be. It does not exist before the loop or afterward, so incorrect uses are not possible.
Listing 7.12 Private variable entering the OpenMP parallel block
1 double x; ❶ 2 #pragma omp parallel for private(x) >> Spawn threads >> ❷ 3 for (int i=0; i < n; i++){ 4 x = 1.0 ❸ 5 double y = x*2.0; ❹ 6 } Implied Barrier Implied Barrier 7 8 double z = x; ❺
1 double x; ❶ 2 #pragma omp parallel for private(x) >> Spawn threads >> ❷ 3 for (int i=0; i < n; i++){ 4 x = 1.0 ❸ 5 double y = x*2.0; ❹ 6 } Implied Barrier Implied Barrier 7 8 double z = x; ❺
❶ 为 block 声明为 outside parallel 的变量。
❶ Variable declared outside parallel for block.
❷ Private clause on parallel for block
❸ X will not be defined, so it must be set first.
❹ Declared private variable within loop is better style.
在清单 7.11 的第 4 行的指令中,我们添加了一个 reduction 子句,以说明对 sum 变量所需的特殊处理。在清单 7.12 的第 2 行,我们展示了 private 指令。还有其他子句可用于 parallel 指令和其他程序块,例如:
On the directive in line 4 in listing 7.11, we added a reduction clause to note the special treatment needed for the sum variable. On line 2 of listing 7.12, we showed the private directive. There are other clauses that can be used on the parallel directive and other program blocks, for example:
我们强烈建议使用 Intel® Inspector 和 Allinea/ARM MAP 等工具来开发更高效的代码并实施高级 OpenMP。我们将在 7.9 节中讨论其中的一些工具。在开始实施高级 OpenMP 之前,必须熟悉各种基本工具。通过这些工具运行应用程序后,更好地了解应用程序可以更顺利地过渡到高级 OpenMP 的实施。
We highly recommend using tools such as Intel® Inspector and Allinea/ARM MAP to develop more efficient code and to implement high-level OpenMP. We discuss some of these tools in section 7.9. Becoming familiar with a variety of essential tools is necessary before beginning the implementation of high-level OpenMP. After running your application through these tools, a better understanding of the application allows for a smoother transition to the implementation of high-level OpenMP.
我们将在第 7.6 节中介绍高级 OpenMP 的概念。但在我们尝试高级 OpenMP 之前,有必要了解如何扩展循环级实现以涵盖更大的代码部分。扩展 loop-level implementation 的目的是降低开销并提高并行效率。当扩展 parallel region 时,它最终会覆盖整个 subroutine。一旦我们将整个函数转换为 OpenMP 并行区域,OpenMP 对变量的线程范围的控制就会少得多。parallel region 的子句不再有用,因为没有位置添加范围子句。那么我们如何控制变量范围呢?
We will introduce the concept of high-level OpenMP in section 7.6. But before we attempt high-level OpenMP, it is necessary to see how the loop-level implementations can be expanded to cover larger sections of code. The purpose for expanding the loop-level implementation is to lower the overhead and increase parallel efficiency. When expanding the parallel region, it eventually covers an entire subroutine. Once we convert the whole function into an OpenMP parallel region, OpenMP provides far less control over the thread scope of the variables. The clauses for a parallel region no longer help as there is no place to add scoping clauses. So how do we control variable scope?
虽然函数中变量范围的默认值通常工作良好,但在某些情况下则不然。函数的唯一 OpenMP 编译指示控制是 threadprivate 指令,该指令将声明的变量设为私有变量。函数中的大多数变量都在堆栈上,并且已经是私有的。如果在例程中动态分配了一个数组,则分配给它的指针是堆栈上的局部变量,这意味着它是私有的,并且每个线程都不同。我们希望这个数组是共享的,但没有指令。使用图 7.7 中的特定编译器范围规则,我们将 save 属性添加到 Fortran 中的指针声明中,使编译器将变量放入堆中,从而在线程之间共享变量。在 C 语言中,变量可以声明为 static 或 make file 范围。下面的清单显示了 Fortran 变量的线程范围的一些示例,清单 7.14 显示了 C 和 C++ 的示例。
While the defaults for variable scope in functions usually work well, there are cases where they don’t. The only OpenMP pragma control for functions is the threadprivate directive that makes a declared variable private. Most variables in a function are on the stack and are already private. If there is an array dynamically allocated in the routine, the pointer it is assigned to is a local variable on the stack, which means it is private and different for every thread. We want this array to be shared, but there is no directive for that. Using the specific compiler scoping rules from figure 7.7, we add a save attribute to the pointer declaration in Fortran, making the compiler put the variable in the heap and, thus, sharing the variable among the threads. In C, the variable can be declared static or made file scope. The following listing shows some examples of the thread scope of variables for Fortran, and listing 7.14 shows examples for C and C++.
Listing 7.13 Function-level variable scope in Fortran
4 subroutine function_level_OpenMP(n, y) 5 integer :: n 6 real :: y(n) ❶ 7 8 real, allocatable :: x(:) ❷ 9 real x1 ❸ 10 real :: x2 = 0.0 ❹ 11 real, save :: x3 ❺ 12 real, save, allocatable :: z ❻ 13 14 if (thread_id .eq. 0) allocate(x(100)) ❼ 15 16 ! lots of code 17 18 if (thread_id .eq. 0) deallocate(x) 19 end subroutine function_level_OpenMP
4 subroutine function_level_OpenMP(n, y) 5 integer :: n 6 real :: y(n) ❶ 7 8 real, allocatable :: x(:) ❷ 9 real x1 ❸ 10 real :: x2 = 0.0 ❹ 11 real, save :: x3 ❺ 12 real, save, allocatable :: z ❻ 13 14 if (thread_id .eq. 0) allocate(x(100)) ❼ 15 16 ! lots of code 17 18 if (thread_id .eq. 0) deallocate(x) 19 end subroutine function_level_OpenMP
❶ Pointer for array y and its array elements that are private
❷ Pointer for allocatable array x that is private
❸ Variable x1 is on the stack, so it is private.
❹ Variable x2 is shared in Fortran 90.
❺ Variable x3 is placed on the heap, so it is shared.
❻ Pointer for z array is on the heap and is shared.
❼ The x array memory is shared, but the pointer to x is private.
第 6 行上数组 y 的指针是子例程位置处的变量范围。在这种情况下,它位于并行区域中,使其成为私有区域。x 的指针和变量 x1 都是私有的。第 10 行中变量 x2 的范围更复杂。它在 Fortran 90 中共享,在 Fortran 77 中私有。Fortran 90 中初始化的变量位于堆上,并且仅在第一次出现时初始化(在本例中为零)!第 11 行和第 12 行上的变量 x3 和 z 是共享的,因为它们位于堆中。第 14 行为 x 分配的内存位于堆上并共享,但指针是私有的,这导致内存只能在线程 0 上访问。
The pointer for array y on line 6 is the scope of the variable at the location of the subroutine. In this case, it is in a parallel region, making it private. Both the pointer for x and the variable x1 are private. The scope of variable x2 on line 10 is more complicated. It is shared in Fortran 90 and private in Fortran 77. Initialized variables in Fortran 90 are on the heap and are only initialized (to zero in this case) on their first occurrence! The variables x3 and z on lines 11 and 12 are shared because these are in the heap. The memory allocated for x on line 14 is on the heap and shared, but the pointer is private, which results in memory only accessible on thread zero.
Listing 7.14 Function-level variable scope in C/C++
5 void function_level_OpenMP(int n, double *y) ❶ 6 { 7 double *x; ❷ 8 static double *x1; ❸ 9 10 int thread_id; 11 #pragma omp parallel 12 thread_id = omp_get_thread_num(); 13 14 if (thread_id == 0) x = (double *)malloc(100*sizeof(double)); ❹ 15 if (thread_id == 0) x1 = (double *)malloc(100*sizeof(double)); ❺ 16 17 // lots of code 18 if (thread_id ==0) free(x); 19 if (thread_id ==0) free(x1); 20 }
5 void function_level_OpenMP(int n, double *y) ❶ 6 { 7 double *x; ❷ 8 static double *x1; ❸ 9 10 int thread_id; 11 #pragma omp parallel 12 thread_id = omp_get_thread_num(); 13 14 if (thread_id == 0) x = (double *)malloc(100*sizeof(double)); ❹ 15 if (thread_id == 0) x1 = (double *)malloc(100*sizeof(double)); ❺ 16 17 // lots of code 18 if (thread_id ==0) free(x); 19 if (thread_id ==0) free(x1); 20 }
❶ The pointer to array y is private.
❷ The pointer to array x is private.
❸ The pointer to array x1 is shared.
❹ Memory for the x array is shared.
❺ Memory for the x1 array is shared.
第 5 行参数列表中数组 y 的指针位于堆栈上。它具有调用位置的变量范围。在并行区域中,指向 y 的指针是私有的。x 数组的内存位于堆上并共享,但指针是私有的,因此只能从线程 0 访问内存。x1 数组的内存位于堆上并共享,指针是共享的,因此内存可以在所有线程之间访问和共享。
The pointer to array y in the argument list on line 5 is on the stack. It has the scope of the variable at the calling location. In a parallel region, the pointer to y is private. The memory for the x array is on the heap and shared, but the pointer is private, so the memory is only accessible from thread zero. Memory for the x1 array is on the heap and shared, and the pointer is shared so the memory is accessible and shared across all the threads.
您需要始终警惕影响线程范围的变量声明和定义的意外影响。例如,使用 Fortran 90 子例程中的值初始化局部变量会自动为该变量提供 save 属性,并且该变量现在是共享的。2 我们建议将 save 属性显式添加到声明中,以避免任何问题或混淆。
You need to be always on guard for unexpected effects of variable declarations and definitions that impact the thread scope. For example, initializing a local variable with a value in a Fortran 90 subroutine automatically gives the variable the save attribute and the variable is now shared.2 We recommend explicitly adding the save attribute to the declaration to avoid any issues or confusion.
为什么使用高级 OpenMP?核心高级 OpenMP 策略是通过最大限度地减少 fork/join 开销和内存延迟来改进标准循环级并行性。减少线程等待时间通常被视为高级 OpenMP 实现的另一个主要激励因素。通过在线程之间显式划分工作,线程不再隐式等待其他线程,因此可以继续进行下一部分的计算。这允许对同步点进行显式控制。在图 7.8 中,与标准 OpenMP 的典型 fork-join 模型不同,高级 OpenMP 使线程保持休眠但处于活动状态,从而极大地减少了开销。
Why use high-level OpenMP? The central high-level OpenMP strategy is to improve on standard loop-level parallelism by minimizing fork/join overhead and memory latency. Reduction of thread wait times is often seen as another major motivating factor of high-level OpenMP implementations. By explicitly dividing the work among the threads, threads are no longer implicitly waiting on other threads and can therefore go on to the next part of the calculation. This allows explicit control of the synchronization point. In figure 7.8, unlike the typical fork-join model of standard OpenMP, high-level OpenMP keeps the threads dormant but alive, thus reducing overhead tremendously.
图 7.8 高级 OpenMP 线程的可视化。线程生成一次,不需要时保持休眠状态。线程边界是手动指定的,并且同步是最小的。
Figure 7.8 Visualization of high-level OpenMP threading. Threads are spawned once and left dormant when not needed. Thread bounds are specified manually and synchronization is minimized.
在本节中,我们将回顾实施高级 OpenMP 所需的明确步骤。然后,我们将向您展示如何从循环级实现转变为高级实现。
In this section, we’ll review the explicit steps needed to implement high-level OpenMP. Then we’ll show you how to go from a loop-level implementation to a high-level implementation.
高级 OpenMP 的实施通常更耗时,因为它需要使用高级工具和广泛的测试。实现高级 OpenMP 也可能很困难,因为它比标准循环级实现更容易出现竞争条件。此外,如何从起点(循环级实现)到终点(高级实现)通常并不明显。
Implementation of high-level OpenMP is often more time-consuming because it requires the use of advanced tools and extensive testing. Implementing high-level OpenMP can also be difficult as it is more prone to race conditions than the standard loop-level implementation. Additionally, it is often not apparent how to get from the starting point (loop-level implementation) to the ending point (high-level implementation).
更乏味的高级开放实现的常见用途是当您想要更高的效率并希望摆脱线程生成和同步成本时。有关高级 OpenMP 的更多信息,请参阅第 7.11 节。您可以通过充分了解应用程序中所有循环的内存边界,启用分析工具,并有条不紊地完成以下步骤,来实现高效的高级 OpenMP。我们建议并展示一种循序渐进、有条不紊的实施策略,该策略可以成功、平稳地过渡到高级 OpenMP 实施。高级 OpenMP 实施的步骤包括
The common use for the more tedious high-level open implementation would be when you want more efficiency and want to get rid of thread spawning and synchronization costs. For more information on high-level OpenMP, see section 7.11. You can implement efficient high-level OpenMP by having a good understanding of the memory bounds of all loops in your application, enabling the use of profiling tools, and methodically working through the following steps. We suggest and show an implementation strategy that is incremental, methodical, and can provide a successful, smooth transition to a high-level OpenMP implementation. Steps to a high-level OpenMP implementation include
Step 1: Reduce thread start up—Merge the parallel regions and join all the loop-level parallel constructs into larger parallel regions
第 2 步:同步 — 将 nowait 子句添加到 for 循环中,其中不需要同步,并计算并手动分区线程之间的循环,从而可以消除障碍和所需的同步。
Step 2: Synchronization—Add nowait clauses to for loops, where synchronization is not needed, and calculate and manually partition the loops across the threads, which allows for removal of barriers and required synchronization.
Step 3: Optimize—Make arrays and variables private to each thread when possible.
Step 4: Code correctness—Check thoroughly for race conditions (after every step).
图 7.9 和 7.10 显示了与前四个步骤相对应的伪代码,从使用 omp parallel do 编译指示的典型循环级实现开始,然后过渡到更高效的高级并行性。
Figures 7.9 and 7.10 show the pseudocode corresponding to the previous four steps, starting with a typical loop-level implementation using omp parallel do pragmas and transitioning to more efficient high-level parallelism.
图 7.9 高级 OpenMP 从循环级 OpenMP 实现开始,将并行区域合并在一起,以降低线程生成的成本。我们使用动物图像来表示进行更改的位置和实际实现的相对速度。turtle 显示的常规循环级别 OpenMP 更快,但每个并行 do 都会产生开销,从而限制加速。dog 表示合并 parallel 区域后速度的相对增益。
Figure 7.9 High-level OpenMP starts with a loop-level OpenMP implementation and merges parallel regions together to reduce the cost of thread spawning. We use the animal images to represent where the changes are made and the relative speed of the actual implementation. The conventional loop level OpenMP shown with the turtle is faster, but there is overhead with each parallel do that limits speedup. The dog represents the relative gain in speed from merging parallel regions.
图 7.10 高级 OpenMP 的后续步骤将添加 nowait 子句 to do 或 for 循环,从而降低同步成本。然后我们自己计算循环边界,并在循环中显式使用这些边界以避免更多的同步。在这里,cheetah 和 hawk 标识了在两个实现中所做的更改。鹰(右侧)比猎豹(左侧)更快,因为 OpenMP 的开销减少了。
Figure 7.10 The next steps for high-level OpenMP add nowait clauses to do or for loops, which reduce synchronization costs. Then we calculate the loop bounds ourselves and explicitly use these in the loops to avoid even more synchronization. Here, the cheetah and the hawk identify the changes made in both implementations. The hawk (on the right) is faster than the cheetah (on the left) as the overhead of the OpenMP is reduced.
在我们实现高级 OpenMP 的步骤中,高级 OpenMP 的第一步减少了线程启动时间。整个代码被放置在单个并行区域中,以最大程度地减少分叉和联接的开销。在高级 OpenMP 中,线程由 parallel 指令在程序执行开始时生成一次。未使用的线程不会死掉,但在运行串行部分时保持休眠状态。为了保证这一点,serial 部分由主线程执行,从而对代码的 serial 部分进行很少或没有更改。一旦程序完成串行部分的运行或再次启动并行区域,就会调用或重用在程序开始时分叉的相同线程。
In our steps to a high-level OpenMP implementation, the thread start-up time is reduced in the first step of high-level OpenMP. The entire code is placed in a single parallel region in order to minimize the overhead of forking and joining. In high-level OpenMP, threads are generated by the parallel directive once, at the beginning of the execution of the program. Unused threads do not die but remain dormant when running through a serial portion. To guarantee this, the serial portion is executed by the main thread, enabling few to no changes in the serial portion of the code. Once the program finishes running through the serial portion or starts a parallel region again, the same threads forked at the beginning of the program are invoked or reused.
步骤 2 解决了默认情况下添加到 OpenMP 中每个 for 循环的同步问题。降低同步成本的最简单方法是在可能的情况下将 nowait 子句添加到所有循环中,同时保持正确性。更进一步的步骤是在线程之间显式划分工作。显式划分 C 工作的典型代码如下所示。(从 1 开始的数组的 Fortran 等效说明如图 7.10 所示。
Step 2 addresses the synchronization added to every for loop in OpenMP by default. The easiest way to reduce synchronization cost is to add nowait clauses to all loops where it is possible, while maintaining correctness. A further step is to explicitly divide the work among threads. The typical code for explicitly dividing the work for C is shown here. (The Fortran equivalent accounting for arrays starting at 1 is shown in figure 7.10.)
tbegin = N * threadID /nthreads tend = N * (threadID+1)/nthreads
tbegin = N * threadID /nthreads tend = N * (threadID+1)/nthreads
数组手动分区的影响在于,它不允许线程共享内存中的相同空间,从而减少了缓存抖动和争用条件。
The impact of the manual partitioning of the arrays is that it reduces cache thrashing and race conditions by not allowing threads to share the same space in memory.
第 3 步,优化,意味着我们明确说明某些变量是共享的还是私有的。通过在内存中为线程提供特定空间,编译器(和程序员)可以放弃对变量状态的猜测。这可以通过应用图 7.7 中的变量范围规则来完成。此外,编译器无法正确并行化包含复杂循环携带的依赖项和非规范形式的循环的循环。高级 OpenMP 通过更明确地定义变量的线程范围来帮助编译器,从而允许并行化复杂的循环。这导致了高级 OpenMP 方法此步骤的最后一部分。数组将在线程之间进行分区。数组的显式分区保证了线程只接触分配给它的内存,并允许我们开始修复内存局部性问题。
Step 3, optimization, means that we explicitly state whether certain variables are shared or private. By giving the threads a specific space in memory, the compiler (and programmer) can forgo guessing about the state of the variables. This can be done by applying the variable scoping rules from figure 7.7. Furthermore, compilers cannot properly parallelize loops that include complex loop-carried dependencies and loops that are not in canonical form. High-level OpenMP helps the compiler by being more explicit about the thread scoping of variables, thus allowing complex loops to be parallelized. This leads into the last part of this step for the high-level OpenMP approach. Arrays will be partitioned across the threads. Explicit partitioning of the arrays guarantees that a thread only touches memory assigned to it and allows us to start fixing memory locality issues.
在最后一步,代码正确性中,使用 Section 7.9 中列出的工具来检测和修复竞争条件非常重要。在下一节中,我们将向您展示实施我们描述的步骤的过程。在本章的 GitHub 源代码中找到的程序将被证明在遵循逐步过程时很有用。
And in the last step, code correctness, it is important to use the tools listed in section 7.9 to detect and fix race conditions. In the next section, we will show you the process of implementing the steps we described. The programs found in the GitHub source for this chapter will prove to be useful in following along with the stepwise process.
您可以通过一系列步骤完成高级 OpenMP 的完整实施。除了在代码中找到计算最密集的循环外,您还应该首先查看代码在应用程序中的瓶颈所在。然后,您可以找到代码的最内层循环,并添加基于标准循环的 OpenMP 指令。需要了解最密集的 Loop 和 Inner Loop 中变量的作用域,参考图 7.7 作为指导。
You can complete a full implementation of high-level OpenMP in a series of steps. You should first look at where the bottleneck(s) of the code are in your application, in addition to finding the most compute-intensive loop in the code. You can then find the innermost level loop of the code and add the standard loop-based OpenMP directives. The scoping of the variables in the most intensive loops and inner loops needs to be understood, referring to figure 7.7 for guidance.
在第 1 步中,您应该专注于降低线程启动成本。这是在清单 7.15 中通过合并并行区域以将整个迭代循环包含在单个并行区域中来完成的。我们开始慢慢地向外移动 OpenMP 指令,从而扩展并行区域。第 49 行和第 57 行上的原始 OpenMP 编译指示可以合并到第 44 行到第 70 行之间的一个并行区域。平行区域的范围由第 45 行和第 70 行上的大括号定义,因此仅开始平行区域一次,而不是 10,000 次。
In step 1, you should focus on reducing the thread start-up costs. This is done in listing 7.15 by merging parallel regions to include the entire iteration loop in a single parallel region. We start slowly moving the OpenMP directives outward, expanding the parallel region. The original OpenMP pragmas on lines 49 and 57 can be merged into one parallel region between lines 44 to 70. The extent of the parallel region is defined by the curly braces on lines 45 and 70 thus only starting the parallel region once instead of 10,000 times.
Listing 7.15 Merging parallel regions into a single parallel region
HighLevelOpenMP_stencil/stencil_opt4.c 44 #pragma omp parallel >> Spawn threads >> ❶ 45 { 46 int thread_id = omp_get_thread_num(); 47 for (int iter = 0; iter < 10000; iter++){ 48 if (thread_id ==0) cpu_timer_start(&tstart_flush); 49 #pragma omp for nowait ❷ 50 for (int l = 1; l < jmax*imax*4; l++){ 51 flush[l] = 1.0; 52 } 53 if (thread_id == 0){ 54 flush_time += cpu_timer_stop(tstart_flush); 55 cpu_timer_start(&tstart_stencil); 56 } 57 #pragma omp for >> Spawn threads >> ❸ 58 for (int j = 1; j < jmax-1; j++){ 59 for (int i = 1; i < imax-1; i++){ 60 xnew[j][i]=(x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i])/5.0; 61 } 62 } Implied Barrier Implied Barrier 63 if (thread_id == 0){ 64 stencil_time += cpu_timer_stop(tstart_stencil); 65 66 SWAP_PTR(xnew, x, xtmp); 67 if (iter%1000 == 0) printf("Iter %d\n",iter); 68 } 69 } 70 } // end omp parallel Implied Barrier Implied Barrier
HighLevelOpenMP_stencil/stencil_opt4.c 44 #pragma omp parallel >> Spawn threads >> ❶ 45 { 46 int thread_id = omp_get_thread_num(); 47 for (int iter = 0; iter < 10000; iter++){ 48 if (thread_id ==0) cpu_timer_start(&tstart_flush); 49 #pragma omp for nowait ❷ 50 for (int l = 1; l < jmax*imax*4; l++){ 51 flush[l] = 1.0; 52 } 53 if (thread_id == 0){ 54 flush_time += cpu_timer_stop(tstart_flush); 55 cpu_timer_start(&tstart_stencil); 56 } 57 #pragma omp for >> Spawn threads >> ❸ 58 for (int j = 1; j < jmax-1; j++){ 59 for (int i = 1; i < imax-1; i++){ 60 xnew[j][i]=(x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i])/5.0; 61 } 62 } Implied Barrier Implied Barrier 63 if (thread_id == 0){ 64 stencil_time += cpu_timer_stop(tstart_stencil); 65 66 SWAP_PTR(xnew, x, xtmp); 67 if (iter%1000 == 0) printf("Iter %d\n",iter); 68 } 69 } 70 } // end omp parallel Implied Barrier Implied Barrier
❶ Single OpenMP parallel region
❷ 用于 pragma 的 OpenMP,在循环结束时没有同步屏障
❷ OpenMP for pragma with no synchronization barrier at end of loop
需要串行运行的代码部分由主线程控制,从而允许并行区域扩展到包含串行和并行区域的大部分代码中。在每个步骤中,请使用 Section 7.9 中讨论的工具来确保应用程序仍然正常运行。
Portions of the code that are required to be run in serial are placed in control of the main thread, allowing for the parallel region to be expanded across large portions of the code that encompass both serial and parallel regions. With each step, use the tools discussed in section 7.9 to make sure that the application still runs correctly.
在实现的第二部分,通过将主 OpenMP 并行循环带到程序的开头,开始过渡到高级 OpenMP。之后,您可以继续计算循环上限和下限。清单 7.16(以及 stencil_opt5.c 和 stencil_opt6.c 中的在线示例)展示了如何计算特定于并行区域的上限和下限。请记住,数组的起点因语言而异:Fortran 从 1 开始,C 从 0 开始。具有相同上限和下限的循环可以使用相同的线程,而无需重新计算边界。
In the second part of the implementation, you begin the transition to high-level OpenMP by carrying the main OpenMP parallel loop to the beginning of the program. After that, you can move on to calculating upper and lower loop bounds. Listing 7.16 (and the online examples in stencil_opt5.c and stencil_opt6.c) shows how you calculate the upper and lower bounds specific to the parallel region. Remember, arrays start at different points depending on language: Fortran starts at 1 and C starts at 0. Loops with the same upper and lower bound can use the same thread without having to recalculate the bounds.
注意您必须小心地在所需位置插入障碍,以防止出现争用情况。放置这些 pragma 时还需要非常小心,因为太多可能会对应用程序的整体性能产生不利影响。
Note You must be careful to insert barriers in required locations to prevent race conditions. Much care also needs to be taken when placing these pragmas as too many could become detrimental to the overall performance of the application.
Listing 7.16 Precalculating loop lower and upper bounds
HighLevelOpenMP_stencil/stencil_opt6.c
29 #pragma omp parallel >> Spawn threads >>
30 {
31 int thread_id = omp_get_thread_num();
32 int nthreads = omp_get_num_threads();
33
34 int jltb = 1 + (jmax-2) * ( thread_id ) / nthreads; ❶
35 int jutb = 1 + (jmax-2) * ( thread_id + 1 ) / nthreads; ❶
36
37 int ifltb = (jmax*imax*4) * ( thread_id ) / nthreads; ❶
38 int ifutb = (jmax*imax*4) * ( thread_id + 1 ) / nthreads; ❶
39
40 int jltb0 = jltb; ❶
41 if (thread_id == 0) jltb0--; ❶
42 int jutb0 = jutb; ❶
43 if (thread_id == nthreads-1) jutb0++; ❶
44
45 int kmin = MAX(jmax/2-5,jltb); ❶
46 int kmax = MIN(jmax/2+5,jutb); ❶
47
48 if (thread_id == 0) cpu_timer_start(&tstart_init); ❷
49 for (int j = jltb0; j < jutb0; j++){ ❸
50 for (int i = 0; i < imax; i++){
51 xnew[j][i] = 0.0;
52 x[j][i] = 5.0;
53 }
54 }
55
56 for (int j = kmin; j < kmax; j++){ ❸
57 for (int i = imax/2 - 5; i < imax/2 -1; i++){
58 x[j][i] = 400.0;
59 }
60 }
61 #pragma omp barrier ❹
Explicit Barrier Explicit Barrier
62 if (thread_id == 0) init_time += cpu_timer_stop(tstart_init);
63
64 for (int iter = 0; iter < 10000; iter++){
65 if (thread_id == 0) cpu_timer_start(&tstart_flush); ❷
66 for (int l = ifltb; l < ifutb; l++){
67 flush[l] = 1.0;
68 }
69 if (thread_id == 0){ ❷
70 flush_time += cpu_timer_stop(tstart_flush); ❷
71 cpu_timer_start(&tstart_stencil); ❷
72 } ❷
73 for (int j = jltb; j < jutb; j++){ ❸
74 for (int i = 1; i < imax-1; i++){
75 xnew[j][i]=( x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i] )/5.0;
76 }
77 }
78 #pragma omp barrier ❹
Explicit Barrier Explicit Barrier
79 if (thread_id == 0){ ❷
80 stencil_time += cpu_timer_stop(tstart_stencil); ❷
81
82 SWAP_PTR(xnew, x, xtmp); ❷
83 if (iter%1000 == 0) printf("Iter %d\n",iter); ❷
84 } ❷
85 #pragma omp barrier ❹
Explicit Barrier Explicit Barrier
86 }
87 } // end omp parallel
Implied Barrier Implied BarrierHighLevelOpenMP_stencil/stencil_opt6.c
29 #pragma omp parallel >> Spawn threads >>
30 {
31 int thread_id = omp_get_thread_num();
32 int nthreads = omp_get_num_threads();
33
34 int jltb = 1 + (jmax-2) * ( thread_id ) / nthreads; ❶
35 int jutb = 1 + (jmax-2) * ( thread_id + 1 ) / nthreads; ❶
36
37 int ifltb = (jmax*imax*4) * ( thread_id ) / nthreads; ❶
38 int ifutb = (jmax*imax*4) * ( thread_id + 1 ) / nthreads; ❶
39
40 int jltb0 = jltb; ❶
41 if (thread_id == 0) jltb0--; ❶
42 int jutb0 = jutb; ❶
43 if (thread_id == nthreads-1) jutb0++; ❶
44
45 int kmin = MAX(jmax/2-5,jltb); ❶
46 int kmax = MIN(jmax/2+5,jutb); ❶
47
48 if (thread_id == 0) cpu_timer_start(&tstart_init); ❷
49 for (int j = jltb0; j < jutb0; j++){ ❸
50 for (int i = 0; i < imax; i++){
51 xnew[j][i] = 0.0;
52 x[j][i] = 5.0;
53 }
54 }
55
56 for (int j = kmin; j < kmax; j++){ ❸
57 for (int i = imax/2 - 5; i < imax/2 -1; i++){
58 x[j][i] = 400.0;
59 }
60 }
61 #pragma omp barrier ❹
Explicit Barrier Explicit Barrier
62 if (thread_id == 0) init_time += cpu_timer_stop(tstart_init);
63
64 for (int iter = 0; iter < 10000; iter++){
65 if (thread_id == 0) cpu_timer_start(&tstart_flush); ❷
66 for (int l = ifltb; l < ifutb; l++){
67 flush[l] = 1.0;
68 }
69 if (thread_id == 0){ ❷
70 flush_time += cpu_timer_stop(tstart_flush); ❷
71 cpu_timer_start(&tstart_stencil); ❷
72 } ❷
73 for (int j = jltb; j < jutb; j++){ ❸
74 for (int i = 1; i < imax-1; i++){
75 xnew[j][i]=( x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i] )/5.0;
76 }
77 }
78 #pragma omp barrier ❹
Explicit Barrier Explicit Barrier
79 if (thread_id == 0){ ❷
80 stencil_time += cpu_timer_stop(tstart_stencil); ❷
81
82 SWAP_PTR(xnew, x, xtmp); ❷
83 if (iter%1000 == 0) printf("Iter %d\n",iter); ❷
84 } ❷
85 #pragma omp barrier ❹
Explicit Barrier Explicit Barrier
86 }
87 } // end omp parallel
Implied Barrier Implied Barrier
❷ 使用线程 ID 而不是 OpenMP 掩码编译指示来消除同步
❷ Uses thread ID instead of OpenMP masked pragma to eliminate synchronization
❸ Uses manually calculated loop bounds
❹ Barrier to synchronize with other threads
要获得正确答案,从最内层的循环开始并了解哪些变量需要保持私有或在线程之间共享,这一点至关重要。当您开始扩大 parallel 区域时,代码的串行部分将被放置在掩码区域中。此区域有一个线程执行所有工作,而其他线程保持活动状态但处于休眠状态。将代码的 serial 部分放入主线程时,不需要或只需要进行少量更改。一旦程序完成通过串行区域的运行或进入并行区域,过去的休眠线程将再次开始工作以并行化当前循环。
To obtain a correct answer, it is crucial to start from the innermost loop and have an understanding of which variables need to stay private or become shared among the threads. As you start enlarging the parallel region, serial portions of the code will be placed into a masked region. This region has one thread that does all the work, while the other threads remain alive but dormant. Zero or only a few changes are required when placing serial portions of the code into a main thread. Once the program finishes running through the serial region or gets into a parallel region, the past dormant threads start working again to parallelize the current loop.
在最后一步中,比较清单 7.14 和 7.15 以及提供的在线模板示例中的高级 OpenMP 实现各个步骤的结果,您可以看到编译指示的数量大大减少,同时也产生了更好的性能(图 7.11)。
For the final step, comparing results for steps along the way to a high-level OpenMP implementation, in listings 7.14 and 7.15 and the provided online stencil examples, you can see that the number of pragmas is greatly reduced while also yielding better performance (figure 7.11).
图 7.11 优化 OpenMP 编译指示既可以减少所需的编译指示数量,又可以提高模板内核的性能。
Figure 7.11 Optimizing the OpenMP pragmas both reduces the number of pragmas required and improves the performance of the stencil kernel.
在本节中,我们将第 6 章的主题与您在本章中学到的内容相结合。这种组合可实现更好的并行化并利用向量处理器。通过将 simd 子句添加到并行 for 中,可以将 OpenMP 线程循环与向量化循环结合使用 #pragma 如 omp parallel for simd。下面的清单显示了 stream 三元组的这一点。
In this section, we will combine topics from chapter 6 with what you have learned in this chapter. This combination yields to better parallelization and utilizes the vector processor. The OpenMP threaded loop can be combined with the vectorized loop by adding the simd clause to the parallel for as in #pragma omp parallel for simd. The following listing shows this for the stream triad.
清单 7.17 流三元组的循环级 OpenMP 线程和向量化
Listing 7.17 Loop-level OpenMP threading and vectorization of the stream triad
StreamTriad/stream_triad_opt3.c 1 #include <stdio.h> 2 #include <time.h> 3 #include <omp.h> 4 #include "timer.h" 5 6 #define NTIMES 16 7 #define STREAM_ARRAY_SIZE 80000000 ❶ 8 static double a[STREAM_ARRAY_SIZE], b[STREAM_ARRAY_SIZE], c[STREAM_ARRAY_SIZE]; 9 10 int main(int argc, char *argv[]){ 11 #pragma omp parallel >> Spawn threads >> 12 if (omp_get_thread_num() == 0) 13 printf("Running with %d thread(s)\n",omp_get_num_threads()); Implied Barrier Implied Barrier 14 15 struct timeval tstart; 16 double scalar = 3.0, time_sum = 0.0; ❷ 17 #pragma omp parallel for simd >> Spawn threads >> 18 for (int i=0; i<STREAM_ARRAY_SIZE; i++) { ❷ 19 a[i] = 1.0; ❷ 20 b[i] = 2.0; ❷ 21 } ❷ Implied Barrier Implied Barrier 22 for (int k=0; k<NTIMES; k++){ 23 cpu_timer_start(&tstart); 24 #pragma omp parallel for simd >> Spawn threads >> 25 for (int i=0; i<STREAM_ARRAY_SIZE; i++){ ❸ 26 c[i] = a[i] + scalar*b[i]; ❸ 27 } ❸ Implied Barrier Implied Barrier 28 time_sum += cpu_timer_stop(tstart); 29 c[1]=c[2]; ❹ 30 } 31 32 printf("Average runtime is %lf msecs\n", time_sum/NTIMES); 33}
StreamTriad/stream_triad_opt3.c 1 #include <stdio.h> 2 #include <time.h> 3 #include <omp.h> 4 #include "timer.h" 5 6 #define NTIMES 16 7 #define STREAM_ARRAY_SIZE 80000000 ❶ 8 static double a[STREAM_ARRAY_SIZE], b[STREAM_ARRAY_SIZE], c[STREAM_ARRAY_SIZE]; 9 10 int main(int argc, char *argv[]){ 11 #pragma omp parallel >> Spawn threads >> 12 if (omp_get_thread_num() == 0) 13 printf("Running with %d thread(s)\n",omp_get_num_threads()); Implied Barrier Implied Barrier 14 15 struct timeval tstart; 16 double scalar = 3.0, time_sum = 0.0; ❷ 17 #pragma omp parallel for simd >> Spawn threads >> 18 for (int i=0; i<STREAM_ARRAY_SIZE; i++) { ❷ 19 a[i] = 1.0; ❷ 20 b[i] = 2.0; ❷ 21 } ❷ Implied Barrier Implied Barrier 22 for (int k=0; k<NTIMES; k++){ 23 cpu_timer_start(&tstart); 24 #pragma omp parallel for simd >> Spawn threads >> 25 for (int i=0; i<STREAM_ARRAY_SIZE; i++){ ❸ 26 c[i] = a[i] + scalar*b[i]; ❸ 27 } ❸ Implied Barrier Implied Barrier 28 time_sum += cpu_timer_stop(tstart); 29 c[1]=c[2]; ❹ 30 } 31 32 printf("Average runtime is %lf msecs\n", time_sum/NTIMES); 33}
❶ Large enough to force into main memory
❹ Keeps the compiler from optimizing out the loop
模板示例的混合实现同时具有线程和向量化,将 for 编译指示放在外部循环上,将 simd 编译指示放在内部循环上,如下面的清单所示。线程化循环和向量化循环在大型数组上的循环中效果最佳,通常模板示例就是这种情况。
The hybrid implementation of the stencil example with both threading and vectorization puts the for pragma on the outer loop and the simd pragma on the inner loop as the following listing shows. Both the threaded and the vectorized loops work best with loops over large arrays as would usually be the case for the stencil example.
Listing 7.18 Stencil example with both threading and vectorization
HybridOpenMP_stencil/stencil_hybrid.c
26 #pragma omp parallel >> Spawn threads >>
27 {
28 int thread_id = omp_get_thread_num();
29 if (thread_id == 0) cpu_timer_start(&tstart_init);
30 #pragma omp for
31 for (int j = 0; j < jmax; j++){
32 #ifdef OMP_SIMD
33 #pragma omp simd ❶
34 #endif
35 for (int i = 0; i < imax; i++){
36 xnew[j][i] = 0.0;
37 x[j][i] = 5.0;
38 }
39 } Implied Barrier Implied Barrier
40
41 #pragma omp for
42 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
43 for (int i = imax/2 - 5; i < imax/2 -1; i++){
44 x[j][i] = 400.0;
45 }
46 } Implied Barrier Implied Barrier
47 if (thread_id == 0) init_time += cpu_timer_stop(tstart_init);
48
49 for (int iter = 0; iter < 10000; iter++){
50 if (thread_id ==0) cpu_timer_start(&tstart_flush);
51 #ifdef OMP_SIMD
52 #pragma omp for simd nowait ❷
53 #else
54 #pragma omp for nowait
55 #endif
56 for (int l = 1; l < jmax*imax*10; l++){
57 flush[l] = 1.0;
58 }
59 if (thread_id == 0){
60 flush_time += cpu_timer_stop(tstart_flush);
61 cpu_timer_start(&tstart_stencil);
62 }
63 #pragma omp for
64 for (int j = 1; j < jmax-1; j++){
65 #ifdef OMP_SIMD
66 #pragma omp simd ❶
67 #endif
68 for (int i = 1; i < imax-1; i++){
69 xnew[j][i]=(x[j][i] + x[j][i-1] + x[j][i+1] +
x[j-1][i] + x[j+1][i])/5.0;
70 }
71 } Implied Barrier Implied Barrier
72 if (thread_id == 0){
73 stencil_time += cpu_timer_stop(tstart_stencil);
74
75 SWAP_PTR(xnew, x, xtmp);
76 if (iter%1000 == 0) printf("Iter %d\n",iter);
77 }
78 #pragma omp barrier
79 }
80 } // end omp parallel
Implied Barrier Implied Barrier HybridOpenMP_stencil/stencil_hybrid.c
26 #pragma omp parallel >> Spawn threads >>
27 {
28 int thread_id = omp_get_thread_num();
29 if (thread_id == 0) cpu_timer_start(&tstart_init);
30 #pragma omp for
31 for (int j = 0; j < jmax; j++){
32 #ifdef OMP_SIMD
33 #pragma omp simd ❶
34 #endif
35 for (int i = 0; i < imax; i++){
36 xnew[j][i] = 0.0;
37 x[j][i] = 5.0;
38 }
39 } Implied Barrier Implied Barrier
40
41 #pragma omp for
42 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
43 for (int i = imax/2 - 5; i < imax/2 -1; i++){
44 x[j][i] = 400.0;
45 }
46 } Implied Barrier Implied Barrier
47 if (thread_id == 0) init_time += cpu_timer_stop(tstart_init);
48
49 for (int iter = 0; iter < 10000; iter++){
50 if (thread_id ==0) cpu_timer_start(&tstart_flush);
51 #ifdef OMP_SIMD
52 #pragma omp for simd nowait ❷
53 #else
54 #pragma omp for nowait
55 #endif
56 for (int l = 1; l < jmax*imax*10; l++){
57 flush[l] = 1.0;
58 }
59 if (thread_id == 0){
60 flush_time += cpu_timer_stop(tstart_flush);
61 cpu_timer_start(&tstart_stencil);
62 }
63 #pragma omp for
64 for (int j = 1; j < jmax-1; j++){
65 #ifdef OMP_SIMD
66 #pragma omp simd ❶
67 #endif
68 for (int i = 1; i < imax-1; i++){
69 xnew[j][i]=(x[j][i] + x[j][i-1] + x[j][i+1] +
x[j-1][i] + x[j+1][i])/5.0;
70 }
71 } Implied Barrier Implied Barrier
72 if (thread_id == 0){
73 stencil_time += cpu_timer_stop(tstart_stencil);
74
75 SWAP_PTR(xnew, x, xtmp);
76 if (iter%1000 == 0) printf("Iter %d\n",iter);
77 }
78 #pragma omp barrier
79 }
80 } // end omp parallel
Implied Barrier Implied Barrier
❶ Adds OpenMP SIMD pragma for inner loops
❷ 为单个循环上的编译指示添加额外的 OpenMP SIMD 编译指示
❷ Adds additional OpenMP SIMD pragma to for pragma on single loop
对于 GCC 编译器,使用向量化和不使用向量化的结果显示向量化的显著加速:
For the GCC compiler, the results with and without vectorization show a significant speedup with vectorization:
4 threads, GCC 8.2 compiler, Skylake Gold 6152 Threads only: Timing init 0.006630 flush 17.110755 stencil 17.374676 total 34.499799 Threads & vectors: Timing init 0.004374 flush 17.498293 stencil 13.943251 total 31.454906
4 threads, GCC 8.2 compiler, Skylake Gold 6152 Threads only: Timing init 0.006630 flush 17.110755 stencil 17.374676 total 34.499799 Threads & vectors: Timing init 0.004374 flush 17.498293 stencil 13.943251 total 31.454906
到目前为止,显示的示例是一组数据的简单循环,复杂性相对较小。在本节中,我们将向您展示如何处理三个需要更多工作的高级示例:
The examples shown so far have been simple loops over a set of data with relatively few complications. In this section, we show you how to handle three advanced examples that require more effort:
Split-direction, two-step stencil—Advanced handling for thread scoping of variables
Prefix scan—Explicitly handling partitioning work among threads
本节中的示例揭示了处理更困难情况的各种方法,并让您对 OpenMP 有更深入的了解。
The examples in this section reveal the various ways to handle more difficult situations and give you a deeper understanding of OpenMP.
在这里,我们将了解在为分割方向、两步模板运算符实施 OpenMP 时出现的潜在困难,其中每个空间方向都进行了单独的传递。模板是数值科学应用的构建块,用于计算偏微分方程的动态解。
Here we will look at the potential difficulties that arise when implementing OpenMP for a split-direction, two-step stencil operator where a separate pass is made for each spatial direction. Stencils are building-blocks for numerical scientific applications and used to calculate dynamic solutions to partial differential equations.
在两步模板中,值是在面上计算的,数据数组具有不同的数据共享要求。图 7.12 表示这样一个带有 2D 面数据数组的模板。此外,这些 2D 数组的其中一个维度通常需要在所有线程或进程之间共享,这是很常见的。x-face 数据更易于处理,因为它与线程数据分解保持一致,但我们不需要每个线程上都有完整的 x-face 数组。y 面数据有一个不同的问题,因为数据是跨线程的,需要共享 y 面 2D 数组。高级 OpenMP 允许快速私有化所需的维度。图 7.12 显示了矩阵的某些维度如何保持私有、共享或两者兼而有之。
In a two-step stencil, where values are calculated on the faces, data arrays have different data-sharing requirements. Figure 7.12 represents such a stencil with 2D-face data arrays. Furthermore, it is common that one of the dimensions of these 2D arrays needs to be shared among all the threads or processes. The x-face data is simpler to deal with because it is aligned with the thread data decomposition, but we don’t need the full x-face array on every thread. The y-face data has a different problem because the data is across threads, necessitating sharing of the y-face 2D array. High-level OpenMP allows for a quick privatization of the dimension needed. Figure 7.12 shows how certain dimensions of a matrix can be kept either private, shared, or both.
大多数内核的 first touch 原则(在 7.1.1 节中定义)表示内存很可能是线程的本地内存(除了页面边界上线程之间的边缘)。我们可以通过在可能的情况下使数组部分对线程完全私有来改善内存局部性,例如 x-face 数据。由于处理器数量不断增加,增加数据局部性对于最小化处理器和内存之间不断扩大的速度差距至关重要。下面的清单显示了一个 serial implementation 开始。
The first touch principle of most kernels (defined in section 7.1.1) says that memory will most likely be local to the thread (except at the edges between threads on page boundaries). We can improve the memory locality by making the array sections completely private to the thread where possible, such as the x-face data. Due to the increasing number of processors, increasing the data locality is essential in minimizing the increasing speed gap between processors and memory. The following listing shows a serial implementation to begin with.
图 7.12 与线程对齐的模板的 x 面需要为每个线程进行私有存储。指针应位于堆栈上,并且每个线程应具有不同的指针。y 面需要共享数据,因此我们在静态数据区域中定义一个指针,两个线程都可以访问它。
Figure 7.12 The x face of a stencil aligned with the threads needs private storage for each thread. The pointer should be on the stack, and each thread should have a different pointer. The y face needs to share the data, so we define one pointer in the static data region where both threads can access it.
Listing 7.19 Split-direction 模板运算符
Listing 7.19 Split-direction stencil operator
SplitStencil/SplitStencil.c
58 void SplitStencil(double **a, int imax, int jmax)
59 {
60 double** xface = malloc2D(jmax, imax); ❶
61 double** yface = malloc2D(jmax, imax); ❶
62 for (int j = 1; j < jmax-1; j++){ ❷
63 for (int i = 0; i < imax-1; i++){ ❷
64 xface[j][i] = (a[j][i+1]+a[j][i])/2.0; ❷
65 } ❷
66 } ❷
67 for (int j = 0; j < jmax-1; j++){ ❸
68 for (int i = 1; i < imax-1; i++){ ❸
69 yface[j][i] = (a[j+1][i]+a[j][i])/2.0; ❸
70 } ❸
71 } ❸
72 for (int j = 1; j < jmax-1; j++){ ❹
73 for (int i = 1; i < imax-1; i++){ ❹
74 a[j][i] = (a[j][i]+xface[j][i]+xface[j][i-1]+
75 yface[j][i]+yface[j-1][i])/5.0; ❹
76 } ❹
77 } ❹
78 free(xface);
79 free(yface);
80 }SplitStencil/SplitStencil.c
58 void SplitStencil(double **a, int imax, int jmax)
59 {
60 double** xface = malloc2D(jmax, imax); ❶
61 double** yface = malloc2D(jmax, imax); ❶
62 for (int j = 1; j < jmax-1; j++){ ❷
63 for (int i = 0; i < imax-1; i++){ ❷
64 xface[j][i] = (a[j][i+1]+a[j][i])/2.0; ❷
65 } ❷
66 } ❷
67 for (int j = 0; j < jmax-1; j++){ ❸
68 for (int i = 1; i < imax-1; i++){ ❸
69 yface[j][i] = (a[j+1][i]+a[j][i])/2.0; ❸
70 } ❸
71 } ❸
72 for (int j = 1; j < jmax-1; j++){ ❹
73 for (int i = 1; i < imax-1; i++){ ❹
74 a[j][i] = (a[j][i]+xface[j][i]+xface[j][i-1]+
75 yface[j][i]+yface[j-1][i])/5.0; ❹
76 } ❹
77 } ❹
78 free(xface);
79 free(yface);
80 }
❶ Calculates values on x and y faces of cells
❷ x-face calculation requires only adjacent cells in the x direction.
❸ y-face calculation requires adjacent cells in the y direction.
❹ Adds in contributions from all the faces of the cell
将 OpenMP 与模板运算符一起使用时,必须确定每个线程的内存是私有内存还是共享内存。在清单 7.18 (之前)中,x 方向的内存可以是全部私有的,从而可以更快地进行计算。在 y 方向(图 7.12)中,模板需要访问相邻线程的数据;因此,必须在线程之间共享此数据。这导致我们得到以下清单中所示的实现。
When using OpenMP with the stencil operator, you must determine whether the memory for each thread needs to be private or shared. In listing 7.18 (previously), the memory for the x-direction can be all private, allowing for faster calculations. In the y-direction (figure 7.12), the stencil requires access to the adjacent thread’s data; therefore, this data must be shared among the threads. This leads us to the implementation shown in the following listing.
列表 7.20 使用 OpenMP 的 split-direction 模板运算符
Listing 7.20 Split-direction stencil operator with OpenMP
SplitStencil/SplitStencil_opt1.c
86 void SplitStencil(double **a, int imax, int jmax)
87 {
88 int thread_id = omp_get_thread_num();
89 int nthreads = omp_get_num_threads();
90
91 int jltb = 1 + (jmax-2) * ( thread_id ) / nthreads; ❶
92 int jutb = 1 + (jmax-2) * ( thread_id + 1 ) / nthreads; ❶
93
94 int jfltb = jltb; ❷
95 int jfutb = jutb; ❷
96 if (thread_id == 0) jfltb--; ❷
97
98 double** xface = (double **)malloc2D(jutb-jltb, imax-1); ❸
99 static double** yface; ❹
100 if (thread_id == 0) yface = (double **)malloc2D(jmax+2, imax); ❺
101 #pragma omp barrier ❻
Explicit Barrier Explicit Barrier
102 for (int j = jltb; j < jutb; j++){ ❼
103 for (int i = 0; i < imax-1; i++){ ❼
104 xface[j-jltb][i] = (a[j][i+1]+a[j][i])/2.0; ❼
105 } ❼
106 } ❼
107 for (int j = jfltb; j < jfutb; j++){ ❽
108 for (int i = 1; i < imax-1; i++){ ❽
109 yface[j][i] = (a[j+1][i]+a[j][i])/2.0; ❽
110 } ❽
111 } ❽
112 #pragma omp barrier ❾
Explicit Barrier Explicit Barrier
113 for (int j = jltb; j < jutb; j++){ ❿
114 for (int i = 1; i < imax-1; i++){ ❿
115 a[j][i] = (a[j][i]+xface[j-jltb][i]+xface[j-jltb][i-1]+ ❿
116 yface[j][i]+yface[j-1][i])/5.0; ❿
117 } ❿
118 } ❿
119 free(xface); ⓫
120 #pragma omp barrier ⓬
Explicit Barrier Explicit Barrier
121 if (thread_id == 0) free(yface); ⓭
122 }SplitStencil/SplitStencil_opt1.c
86 void SplitStencil(double **a, int imax, int jmax)
87 {
88 int thread_id = omp_get_thread_num();
89 int nthreads = omp_get_num_threads();
90
91 int jltb = 1 + (jmax-2) * ( thread_id ) / nthreads; ❶
92 int jutb = 1 + (jmax-2) * ( thread_id + 1 ) / nthreads; ❶
93
94 int jfltb = jltb; ❷
95 int jfutb = jutb; ❷
96 if (thread_id == 0) jfltb--; ❷
97
98 double** xface = (double **)malloc2D(jutb-jltb, imax-1); ❸
99 static double** yface; ❹
100 if (thread_id == 0) yface = (double **)malloc2D(jmax+2, imax); ❺
101 #pragma omp barrier ❻
Explicit Barrier Explicit Barrier
102 for (int j = jltb; j < jutb; j++){ ❼
103 for (int i = 0; i < imax-1; i++){ ❼
104 xface[j-jltb][i] = (a[j][i+1]+a[j][i])/2.0; ❼
105 } ❼
106 } ❼
107 for (int j = jfltb; j < jfutb; j++){ ❽
108 for (int i = 1; i < imax-1; i++){ ❽
109 yface[j][i] = (a[j+1][i]+a[j][i])/2.0; ❽
110 } ❽
111 } ❽
112 #pragma omp barrier ❾
Explicit Barrier Explicit Barrier
113 for (int j = jltb; j < jutb; j++){ ❿
114 for (int i = 1; i < imax-1; i++){ ❿
115 a[j][i] = (a[j][i]+xface[j-jltb][i]+xface[j-jltb][i-1]+ ❿
116 yface[j][i]+yface[j-1][i])/5.0; ❿
117 } ❿
118 } ❿
119 free(xface); ⓫
120 #pragma omp barrier ⓬
Explicit Barrier Explicit Barrier
121 if (thread_id == 0) free(yface); ⓭
122 }
❶ Manually calculates distribution of data across threads
❷ The y faces have one less data value to distribute.
❸ Allocates a private portion of the x-face data for each thread
❹ 将 y 面数据指针声明为 static,以便它具有共享范围。
❹ Declares the y-face data pointer as static so it has shared scope.
❺ Allocates one version of the y-face array to be shared across threads
❻ Inserts an OpenMP barrier so that all threads have the allocated memory
❼ Does the local x-face calculation on each thread
❽ The y-face calculation has a j+1 and thus needs a shared array.
❾ 我们需要 OpenMP 同步,因为下一个循环使用相邻的线程工作。
❾ We need an OpenMP synchronization because the next loop uses an adjacent thread work.
❿ 将上一个 x 面和 y 面循环的工作合并到一个新的单元格值中
❿ Combines the work from the previous x-face and y-face loops into a new cell value
⓫ Frees local x-face array for each thread
⓬ A barrier ensures all threads are done with the shared y-face array.
⓭ Frees the y-face array on only one processor
要定义堆栈上的内存,如 x 方向所示,我们需要一个指向指向 double (double **xface) 的指针,以便指针位于堆栈上并且对每个线程都是私有的。然后,我们在清单 7.20 的第 98 行使用自定义的 2D malloc 调用来分配内存。我们只需要为每个线程提供足够的内存,因此我们在第 91 行和第 92 行中计算线程边界,并在 2D malloc 调用中使用它们。内存是从堆分配的,可以共享,但每个线程只有自己的指针;因此,每个线程都无法访问其他线程的内存。
To define the memory on the stack as shown in the x-direction, we need a pointer to a pointer to a double (double **xface) so that the pointer is on the stack and private to each thread. Then we allocate the memory using a custom 2D malloc call at line 98 in listing 7.20. We only need enough memory for each thread, so we compute the thread bounds in lines 91 and 92 and use these in the 2D malloc call. The memory is allocated from the heap and can be shared, but each thread only has its own pointer; therefore, each thread can’t access the other threads’ memory.
我们可以使用自动分配,例如 double xface[3][6],其中内存在堆栈上自动分配,而不是从堆中分配内存。编译器会自动看到此声明并将内存空间推送到堆栈上。在数组很大的情况下,编译器可能会将内存要求移动到堆。每个编译器在决定是将内存放在堆上还是堆栈上时都有不同的阈值。如果编译器将内存移动到堆,则只有一个线程具有指向此位置的指针。实际上,它是私有的,即使它位于共享内存空间中。
Rather than allocating memory from the heap, we could have used the automatic allocation, such as double xface[3][6], where the memory is automatically allocated on the stack. The compiler automatically sees this declaration and pushes the memory space onto the stack. In cases where the arrays are large, the compiler might move the memory requirement to the heap. Each compiler has a different threshold on deciding whether to place memory on the heap or on the stack. If the compiler moves the memory to the heap, only one thread has the pointer to this location. In effect, it is private, even though it is in shared memory space.
对于 y 面,我们定义了一个指向指针的静态指针 (static double **yface),其中所有线程都可以访问同一个指针。在这种情况下,只有一个线程需要执行此内存分配,并且所有剩余线程都可以访问此指针和内存本身。对于此示例,您可以使用图 7.7 来查看共享内存的不同选项。在这种情况下,您将转到 Parallel Region -> C 例程,然后选择一个文件范围变量 extern 或 static,以使指针在线程之间共享。很容易出错,例如在变量范围、内存分配或同步中。例如,如果我们只定义一个常规的 double **yfaces 指针,会发生什么情况。现在每个线程都有自己的私有指针,并且只有一个指针分配了内存。第二个线程的指针不会指向任何内容,从而在使用时生成错误。
For the y-faces, we define a static pointer to a pointer (static double **yface), where all threads can access the same pointer. In this case, only one thread needs to do this memory allocation, and all remaining threads can access this pointer and the memory itself. For this example, you can use figure 7.7 to see the different options of making the memory shared. In this case, you would go to the Parallel Region -> C Routine and pick one of the file scope variables, extern or static, to make the pointer shared among the threads. It is easy to get something wrong such as in the variable scope, the memory allocation, or the synchronization. For example, what happens if we just define a regular double **yfaces pointer. Now each thread has its own private pointer and only one of these gets memory allocated. The pointer for the second thread would not point to anything, generating an error when it is used.
图 7.13 显示了在 Skylake Gold 处理器上运行线程版本代码的性能。对于少量线程,我们会在超过 8 个线程时获得超线性加速。超线性加速有时会发生,因为随着数据跨线程或处理器分区,缓存性能会提高。
Figure 7.13 shows the performance for running the threaded version of the code on the Skylake Gold processor. For a small number of threads, we get a super-linear speedup before falling off at above eight threads. Super-linear speedup happens on occasion because the cache performance improves as the data is partitioned across threads or processors.
图 7.13 分割型网的线程版本具有 2 到 8 个线程的超线性加速。
Figure 7.13 The threaded version of the split stencil has a super-linear speedup for two to eight threads.
定义超线性加速比是优于强扩展的理想扩展曲线的性能。这可能是因为较小的数组大小适合更高级别的缓存,从而获得更好的缓存性能。
Definition Super-linear speedup is performance that’s better than the ideal scaling curve for strong scaling. This can happen because the smaller array sizes fit into a higher level of the cache, resulting in better cache performance.
对于第 5.7 节中介绍的增强精度 Kahan 求和算法,由于循环携带的依赖关系,我们不能使用 pragma 来让编译器生成多线程实现。因此,我们将遵循与 6.3.4 节中的向量化实现中使用的类似的算法。我们首先在计算的第一阶段对每个线程上的值求和。然后,我们将线程中的值相加,得到最终的总和,如下面的清单所示。
For the enhanced-precision Kahan summation algorithm, introduced in section 5.7, we cannot use a pragma to get the compiler to generate a multi-threaded implementation because of the loop-carried dependencies. Therefore, we’ll follow a similar algorithm as we used in the vectorized implementation in section 6.3.4. We first sum up the values on each thread in the first phase of the calculation. Then we sum the values across the threads to get the final sum as the following listing shows.
Listing 7.21 An OpenMP implementation of the Kahan summation
GlobalSums/kahan_sum.c
1 #include <stdlib.h>
2 #include <omp.h>
3
4 double do_kahan_sum(double* restrict var, long ncells)
5 {
6 struct esum_type{
7 double sum;
8 double correction;
9 };
10
11 int nthreads = 1; ❶
12 int thread_id = 0; ❶
13 #ifdef _OPENMP
14 nthreads = omp_get_num_threads();
15 thread_id = omp_get_thread_num();
16 #endif
17
18 struct esum_type local;
19 local.sum = 0.0;
20 local.correction = 0.0;
21
22 int tbegin = ncells * ( thread_id ) / nthreads; ❷
23 int tend = ncells * ( thread_id + 1 ) / nthreads; ❷
24
25 for (long i = tbegin; i < tend; i++) {
26 double corrected_next_term = var[i] + local.correction;
27 double new_sum = local.sum + local.correction;
28 local.correction = corrected_next_term - (new_sum - local.sum);
29 local.sum = new_sum;
30 }
31
32 static struct esum_type *thread; ❸
33 static double sum; ❸
34
35 #ifdef _OPENMP ❹
36 #pragma omp masked
37 thread = malloc(nthreads*sizeof(struct esum_type)); ❺
38 #pragma omp barrier
Explicit Barrier Explicit Barrier
39
40 thread[thread_id].sum = local.sum; ❻
41 thread[thread_id].correction = local.correction; ❻
42
43 #pragma omp barrier ❼
Explicit Barrier Explicit Barrier
44
45 static struct esum_type global;
46 #pragma omp masked ❽
47 {
48 global.sum = 0.0;
49 global.correction = 0.0;
50 for ( int i = 0 ; i < nthreads ; i ++ ) {
51 double corrected_next_term = thread[i].sum +
52 thread[i].correction + global.correction;
53 double new_sum = global.sum + global.correction;
54 global.correction = corrected_next_term -
(new_sum - global.sum);
55 global.sum = new_sum;
56 }
57
58 sum = global.sum + global.correction;
59 free(thread);
60 } // end omp masked
61 #pragma omp barrier
Explicit Barrier Explicit Barrier
62 #else
63 sum = local.sum + local.correction;
64 #endif
65
66 return(sum);
67 }GlobalSums/kahan_sum.c
1 #include <stdlib.h>
2 #include <omp.h>
3
4 double do_kahan_sum(double* restrict var, long ncells)
5 {
6 struct esum_type{
7 double sum;
8 double correction;
9 };
10
11 int nthreads = 1; ❶
12 int thread_id = 0; ❶
13 #ifdef _OPENMP
14 nthreads = omp_get_num_threads();
15 thread_id = omp_get_thread_num();
16 #endif
17
18 struct esum_type local;
19 local.sum = 0.0;
20 local.correction = 0.0;
21
22 int tbegin = ncells * ( thread_id ) / nthreads; ❷
23 int tend = ncells * ( thread_id + 1 ) / nthreads; ❷
24
25 for (long i = tbegin; i < tend; i++) {
26 double corrected_next_term = var[i] + local.correction;
27 double new_sum = local.sum + local.correction;
28 local.correction = corrected_next_term - (new_sum - local.sum);
29 local.sum = new_sum;
30 }
31
32 static struct esum_type *thread; ❸
33 static double sum; ❸
34
35 #ifdef _OPENMP ❹
36 #pragma omp masked
37 thread = malloc(nthreads*sizeof(struct esum_type)); ❺
38 #pragma omp barrier
Explicit Barrier Explicit Barrier
39
40 thread[thread_id].sum = local.sum; ❻
41 thread[thread_id].correction = local.correction; ❻
42
43 #pragma omp barrier ❼
Explicit Barrier Explicit Barrier
44
45 static struct esum_type global;
46 #pragma omp masked ❽
47 {
48 global.sum = 0.0;
49 global.correction = 0.0;
50 for ( int i = 0 ; i < nthreads ; i ++ ) {
51 double corrected_next_term = thread[i].sum +
52 thread[i].correction + global.correction;
53 double new_sum = global.sum + global.correction;
54 global.correction = corrected_next_term -
(new_sum - global.sum);
55 global.sum = new_sum;
56 }
57
58 sum = global.sum + global.correction;
59 free(thread);
60 } // end omp masked
61 #pragma omp barrier
Explicit Barrier Explicit Barrier
62 #else
63 sum = local.sum + local.correction;
64 #endif
65
66 return(sum);
67 }
❶ Gets the total number of threads and thread_id
❷ Computes the range for which this thread is responsible
❸ Puts the variables in shared memory
❹ Defines the compiler variable _OPENMP when using OpenMP
❺ Allocates one thread in shared memory
❻ Stores the summation of each thread in array
❼ Waits until all threads get here and then sums across threads
❽ Uses a single thread to compute the beginning offset for each thread
在本节中,我们将了解 prefix scan 操作的线程实现。第 5.6 节中介绍的 prefix scan 操作对于具有不规则数据的算法非常重要。这是因为用于确定 ranks 或 threads 的起始位置的 count 允许并行完成其余的计算。如该部分所述,前缀扫描也可以并行完成,从而产生另一个并行化优势。实施过程分为三个阶段:
In this section, we look at the threaded implementation of the prefix scan operation. The prefix scan operation, introduced in section 5.6, is important for algorithms with irregular data. This is because a count to determine the starting location for ranks or threads allows the rest of the calculation to be done in parallel. As discussed in that section, the prefix scan can also be done in parallel, yielding another parallelization benefit. The implementation process has three phases:
All threads—Calculates a prefix scan for each thread’s portion of the data
Single thread—Calculates the starting offset for each thread’s data
All threads—Applies the new thread offset across all the data for each thread
清单 7.22 中的实现适用于串行应用程序,并且当从 OpenMP 并行区域内调用时。这样做的好处是,您可以将列表中的代码用于串行和线程情况,从而减少此操作的代码重复。
The implementation in listing 7.22 works for a serial application and when called from within an OpenMP parallel region. This has the benefit that you can use the code in the listing for both serial and threaded cases, reducing the code duplication for this operation.
Listing 7.22 An OpenMP implementation of the prefix scan
PrefixScan/PrefixScan.c
1 void PrefixScan (int *input, int *output, int length)
2 {
3 int nthreads = 1; ❶
4 int thread_id = 0; ❶
5 #ifdef _OPENMP
6 nthreads = omp_get_num_threads(); ❶
7 thread_id = omp_get_thread_num(); ❶
8 #endif
9
10 int tbegin = length * ( thread_id ) / nthreads; ❷
11 int tend = length * ( thread_id + 1 ) / nthreads; ❷
12
13 if ( tbegin < tend ) { ❸
14 output[tbegin] = 0; ❹
15 for ( int i = tbegin + 1 ; i < tend ; i++ ) { ❹
16 output[i] = output[i-1] + input[i-1]; ❹
17 }
18 }
19 if (nthreads == 1) return; ❺
20
21 #ifdef _OPENMP
22 #pragma omp barrier ❻
Explicit Barrier Explicit Barrier
23
24 if (thread_id == 0) { ❼
25 for ( int i = 1 ; i < nthreads ; i ++ ) {
26 int ibegin = length * ( i - 1 ) / nthreads;
27 int iend = length * ( i ) / nthreads;
28
29 if ( ibegin < iend )
30 output[iend] = output[ibegin] + input[iend-1];
31
32 if ( ibegin < iend - 1 )
33 output[iend] += output[iend-1];
34 }
35 }
36 #pragma omp barrier ❽
Explicit Barrier Explicit Barrier
37
38 #pragma omp simd ❾
39 for ( int i = tbegin + 1 ; i < tend ; i++ ) { ❿
40 output[i] += output[tbegin]; ❿
41 } ❿
42 #endif
43}PrefixScan/PrefixScan.c
1 void PrefixScan (int *input, int *output, int length)
2 {
3 int nthreads = 1; ❶
4 int thread_id = 0; ❶
5 #ifdef _OPENMP
6 nthreads = omp_get_num_threads(); ❶
7 thread_id = omp_get_thread_num(); ❶
8 #endif
9
10 int tbegin = length * ( thread_id ) / nthreads; ❷
11 int tend = length * ( thread_id + 1 ) / nthreads; ❷
12
13 if ( tbegin < tend ) { ❸
14 output[tbegin] = 0; ❹
15 for ( int i = tbegin + 1 ; i < tend ; i++ ) { ❹
16 output[i] = output[i-1] + input[i-1]; ❹
17 }
18 }
19 if (nthreads == 1) return; ❺
20
21 #ifdef _OPENMP
22 #pragma omp barrier ❻
Explicit Barrier Explicit Barrier
23
24 if (thread_id == 0) { ❼
25 for ( int i = 1 ; i < nthreads ; i ++ ) {
26 int ibegin = length * ( i - 1 ) / nthreads;
27 int iend = length * ( i ) / nthreads;
28
29 if ( ibegin < iend )
30 output[iend] = output[ibegin] + input[iend-1];
31
32 if ( ibegin < iend - 1 )
33 output[iend] += output[iend-1];
34 }
35 }
36 #pragma omp barrier ❽
Explicit Barrier Explicit Barrier
37
38 #pragma omp simd ❾
39 for ( int i = tbegin + 1 ; i < tend ; i++ ) { ❿
40 output[i] += output[tbegin]; ❿
41 } ❿
42 #endif
43}
❶ Gets the total number of threads and thread_id
❷ Computes the range for which this thread is responsible
❸ Only performs this operation if there is a positive number of entries.
❹ Does an exclusive scan for each thread
❺ For multiple threads only, do the adjustment to prefix scan for the beginning value for each thread
❻ Waits until all threads get here
❼ Uses the main thread to compute the beginning offset for each thread
❽ Ends calculation on main thread with barrier
❿ Applies the offset to the range for this thread
This algorithm should theoretically scale as
Parallel_timer = 2 * serial_time/nthreads
Parallel_timer = 2 * serial_time/nthreads
Skylake Gold 6152 架构的性能峰值约为 44 线程,比串行版本快 9.4 倍。
The performance on the Skylake Gold 6152 architecture peaks at about 44 threads, 9.4 times faster than the serial version.
如果不使用专门的工具来检测线程争用条件和性能瓶颈,则很难开发强大的 OpenMP 实施。当您尝试获得更高性能的 OpenMP 实施时,工具的使用变得更加重要。有商业工具和公开可用的工具。在应用程序中集成 OpenMP 的高级实现时,典型的工具列表包括:
Developing a robust OpenMP implementation is difficult without using specialized tools for detecting thread race conditions and performance bottlenecks. The use of tools becomes much more important as you try to get a higher performance OpenMP implementation. There are both commercial and openly available tools. The typical tool list when integrating advanced implementations of OpenMP in your application includes:
Valgrind - 在 2.1.3 节中介绍的内存工具。它还可以与 OpenMP 配合使用,并有助于在线程中查找未初始化的内存或越界访问。
Valgrind—A memory tool introduced in section 2.1.3. It also works with OpenMP and helps in finding uninitialized memory or out-of-bounds accesses in threads.
调用图 — cachegrind 工具生成调用图和应用程序的配置文件。调用图确定哪些函数调用其他函数,以清楚地显示调用层次结构和代码路径。cachegrind 工具的一个例子在 3.3.1 节 中介绍。
Call graph—The cachegrind tool produces a call graph and a profile of your application. A call graph determines which functions call other functions to clearly show the call hierarchy and code path. An example of the cachegrind tool was presented in section 3.3.1.
Allinea/ARM Map—A high-level profiler to get an overall cost of thread starts and barriers (for OpenMP apps).
Intel® Inspector—To detect thread race conditions (for OpenMP apps).
我们在前面的章节中介绍了前两个工具;他们可以在那里引用。在本节中,我们将讨论最后两个工具,因为它们与 OpenMP 应用程序的关系更密切。需要这些工具来分析瓶颈并了解它们在应用程序中的位置,因此,对于了解从何处以有效的方式开始更改代码至关重要。
We described the first two tools in earlier chapters; they can be referred to there. In this section, we will discuss the last two tools as these relate more to an OpenMP application. These tools are needed to profile the bottlenecks and understand where they lie within your application and, thus, are essential in knowing where to best start changing your code in an efficient manner.
获取高级应用程序配置文件的更好工具之一是 Allinea/ARM MAP。图 7.14 显示了其界面的简化视图。对于 OpenMP 应用程序,它显示线程启动和等待的成本,突出显示应用程序的瓶颈,并显示内存 CPU 浮点利用率的使用情况。分析器可以轻松比较代码更改前后的收益。Allinea/ARM MAP 擅长生成应用程序的快速、高级视图,但还有许多其他分析器可以使用。其中一些将在 Section 17.3 中回顾。
One of the better tools to get a high-level application profile is Allinea/ARM MAP. Figure 7.14 shows a simplified view of its interface. For an OpenMP application, it shows the cost for thread starts and waits, highlights the application’s bottlenecks, and shows the usage of memory CPU floating point utilization. The profiler makes it easy to compare the gains made before and after code changes. Allinea/ARM MAP excels at producing a quick, high-level view of your application, but there are many other profilers that can be used. Some of these are reviewed in section 17.3.
图 7.14 这些是 Allinea/ARM MAP 的结果,显示了高亮代码行上的大部分计算时间。我们经常使用这样的指标来向我们展示瓶颈的位置。
Figure 7.14 These are results from Allinea/ARM MAP showing the majority of the compute time on the highlighted line of code. We often use indicators like this to show us the location of bottlenecks.
在 OpenMP 实现中查找并消除线程争用条件,以生成健壮的生产质量应用程序,这一点非常重要。为此,工具是必不可少的,因为即使是最好的程序员也不可能捕获所有线程争用条件。随着应用程序开始扩展,内存错误会更频繁地发生,并可能导致应用程序中断。尽早发现这些内存错误可以节省将来运行的时间和精力。
It is essential to find and eliminate thread race conditions in an OpenMP implementation to produce a robust, production-quality application. For this purpose, tools are essential because it is impossible for even the best programmer to catch all the thread race conditions. As the application begins to scale, memory errors occur more frequently and can cause an application to break. Catching these memory errors early on saves time and energy on future runs.
可以有效查找线程争用条件的工具并不多。我们展示了如何使用这些工具之一 Intel® Inspector 来检测和精确定位这些争用条件的位置。在扩展到更大的线程计数时,拥有了解内存中线程争用条件的工具也很有用。图 7.15 提供了 Intel® Inspector 的示例屏幕截图。
There are not many tools that are effective at finding thread race conditions. We show the use of one of these tools, the Intel® Inspector, to detect and pinpoint the location of these race conditions. Having tools to understand thread race conditions in memory is also useful when scaling to larger thread counts. Figure 7.15 provides a sample screenshot of Intel® Inspector.
图 7.15 Intel® Inspector 报告显示线程竞争条件的检测。此处,左上角面板上 Type 标题下列为 Data race 的项显示了当前存在争用条件的所有位置。
Figure 7.15 Intel® Inspector report showing detection of thread race conditions. Here the items listed as Data race under the Type heading on the panel to the upper left show all the places where there is currently a race condition.
在对初始应用程序进行更改之前,完成回归测试至关重要。确保正确性对于成功实施 OpenMP 线程至关重要。除非应用程序或整个子例程处于正确的工作状态,否则无法实现正确的 OpenMP 代码。这还要求在回归测试中也必须执行使用 OpenMP 进行线程化的代码部分。如果无法进行回归测试,就很难取得稳步进展。总之,这些工具以及回归测试可以更好地了解大多数应用程序中的依赖关系、效率和正确性。
Before changes in the initial application are made, it is critical to complete regression testing. Ensuring correctness is crucial to the successful implementation of OpenMP threading. A correct OpenMP code cannot be implemented unless an application or a whole subroutine is in its proper working state. This also requires that the section of code that is being threaded with OpenMP must also be exercised in a regression test. Without being able to do regression testing, it becomes difficult to make steady progress. In summary, these tools, along with regression testing, create a better understanding of the dependencies, efficiency, and correctness in most applications.
基于任务的并行策略在第 1 章中首次介绍,如图 1.25 所示。使用基于任务的方法,您可以将工作划分为单独的任务,然后可以将这些任务分包到各个流程。许多算法更自然地表示为基于任务的方法。OpenMP 自 3.0 版以来一直支持此类方法。在后续的标准版本中,对基于任务的模型进行了进一步改进。在本节中,我们将向您展示一种简单的基于任务的算法,以说明 OpenMP 中的技术。
The task-based parallel strategy was first introduced in chapter 1 and illustrated in figure 1.25. Using a task-based approach, you can divide work into separate tasks that can then be parceled out to individual processes. Many algorithms are more naturally expressed in terms of a task-based approach. OpenMP has supported this type of approach since its version 3.0. In the subsequent standard releases, there have been further improvements to the task-based model. In this section we’ll show you a simple task-based algorithm to illustrate the techniques in OpenMP.
实现可重现的全局总和的方法之一是以成对方式对值求和。常规数组方法需要分配一个工作数组和一些复杂的索引逻辑。如图 7.16 所示,使用基于任务的方法通过在向下扫描中递归地将数据分成两半,直到达到数组长度 1,然后在向上扫描中对求和,从而避免了对工作数组的需求。
One of the approaches to a reproducible global sum is to sum up the values in a pairwise manner. The normal array approach requires the allocation of a working array and some complicated indexing logic. Using a task-based approach as in figure 7.16 avoids the need for a working array by recursively splitting the data in half in the downward sweep, until an array length of 1 is reached, and then summing up the pairs in the upward sweep.
图 7.16 基于任务的实现在向下扫描时递归地将数组分成两半。一旦数组大小为 1,该任务就会在向上扫描中对数据对求和。
Figure 7.16 The task-based implementation recursively splits the array into half on the downward sweep. Once an array size of 1 occurs, the task sums pairs of data in the upward sweep.
清单 7.23 显示了基于任务的方法的代码。任务的生成需要在并行区域中完成,但只能由一个线程完成,这会导致第 8 行到第 14 行中出现编译指示的嵌套块。
Listing 7.23 shows the code for the task-based approach. The spawning of the task needs to be done in a parallel region but by only one thread, leading to the nested blocks of pragmas in lines 8 to 14.
Listing 7.23 A pair-wise summation using OpenMP tasks
PairwiseSumByTask/PairwiseSumByTask.c
1 #include <omp.h>
2
3 double PairwiseSumBySubtask(double* restrict var, long nstart, long nend);
4
5 double PairwiseSumByTask(double* restrict var, long ncells)
6 {
7 double sum;
8 #pragma omp parallel >> Spawn threads >> ❶
9 {
10 #pragma omp masked ❷
11 {
12 sum = PairwiseSumBySubtask(var, 0, ncells); ❷
13 }
14 } Implied Barrier Implied Barrier
15 return(sum);
16 }
17
18 double PairwiseSumBySubtask(double* restrict var, long nstart, long nend)
19 {
20 long nsize = nend - nstart;
21 long nmid = nsize/2; ❸
22 double x,y;
23 if (nsize == 1){ ❹
24 return(var[nstart]); ❹
25 }
26
27 #pragma omp task shared(x) mergeable final(nsize > 10) ❺
28 x = PairwiseSumBySubtask(var, nstart, nstart + nmid); ❺
29 #pragma omp task shared(y) mergeable final(nsize > 10) ❺
30 y = PairwiseSumBySubtask(var, nend - nmid, nend); ❺
31 #pragma omp taskwait ❻
32
33 return(x+y); ❼
34 }PairwiseSumByTask/PairwiseSumByTask.c
1 #include <omp.h>
2
3 double PairwiseSumBySubtask(double* restrict var, long nstart, long nend);
4
5 double PairwiseSumByTask(double* restrict var, long ncells)
6 {
7 double sum;
8 #pragma omp parallel >> Spawn threads >> ❶
9 {
10 #pragma omp masked ❷
11 {
12 sum = PairwiseSumBySubtask(var, 0, ncells); ❷
13 }
14 } Implied Barrier Implied Barrier
15 return(sum);
16 }
17
18 double PairwiseSumBySubtask(double* restrict var, long nstart, long nend)
19 {
20 long nsize = nend - nstart;
21 long nmid = nsize/2; ❸
22 double x,y;
23 if (nsize == 1){ ❹
24 return(var[nstart]); ❹
25 }
26
27 #pragma omp task shared(x) mergeable final(nsize > 10) ❺
28 x = PairwiseSumBySubtask(var, nstart, nstart + nmid); ❺
29 #pragma omp task shared(y) mergeable final(nsize > 10) ❺
30 y = PairwiseSumBySubtask(var, nend - nmid, nend); ❺
31 #pragma omp taskwait ❻
32
33 return(x+y); ❼
34 }
❷ Starts main task on one thread
❸ Subdivides the array into two parts
❹ Initializes sum at leaf with single value from array
❺ Launches a pair of subtasks with half of the data for each
❻ Waits for two tasks to complete
❼ Sums the values from the two subtasks and returns to the calling thread
使用基于任务的算法获得良好的性能需要更多的调整,以防止生成过多的线程并保持任务的合理粒度。对于某些算法,基于任务的算法是更合适的并行策略。
Getting good performance with a task-based algorithm takes a lot more tuning to prevent too many threads from being spawned and to keep granularity of the tasks reasonable. For some algorithms, task-based algorithms are a much more appropriate parallel strategy.
关于传统的基于线程的 OpenMP 编程,有许多材料。由于几乎每个编译器都支持 OpenMP,因此最好的学习方法是开始将 OpenMP 指令添加到您的代码中。OpenMP 有很多培训机会,包括 11 月举行的年度超级计算会议。有关信息,请参阅 https://sc21.supercomputing.org/。对于那些对 OpenMP 更感兴趣的人,每年都会举办一次 OpenMP 国际研讨会,介绍最新的发展情况。有关信息,请参阅 http://www .iwomp.org/。
There are many materials on traditional thread-based OpenMP programming. With nearly every compiler supporting OpenMP, the best learning approach is to simply start adding OpenMP directives to your code. There are many training opportunities covering OpenMP, including the annual Supercomputing Conference held in November. For information, see https://sc21.supercomputing.org/. For those who are even more interested in OpenMP, there is an International Workshop on OpenMP held every year that covers the latest developments. For information, see http://www .iwomp.org/.
Barbara Chapman 是 OpenMP 的主要作家和权威之一。她的书是 OpenMP 编程的标准参考书,尤其是 2008 年 OpenMP 中的线程实现:
Barbara Chapman is one of the leading writers and authorities on OpenMP. Her book is the standard reference for OpenMP programming, especially for the threading implementation in OpenMP as of 2008:
Barbara Chapman、Gabriele Jost 和 Ruud Van Der Pas,《使用 OpenMP:可移植共享内存并行编程》,第 10 卷(麻省理工学院出版社,2008 年)。
Barbara Chapman, Gabriele Jost, and Ruud Van Der Pas, Using OpenMP: portable shared memory parallel programming, vol. 10 (MIT Press, 2008).
有许多研究人员致力于开发更有效的 OpenMP 实施技术,OpenMP 后来被称为高级 OpenMP。以下是介绍高级 OpenMP 的更多详细信息的幻灯片链接:
There are many researchers working on developing more efficient techniques of implementing OpenMP, which has come to be called high-level OpenMP. Here is a link to slides going into more detail on high-level OpenMP:
Yuliana Zamora,“Intel Knights Landing 上的有效 OpenMP 实施”,洛斯阿拉莫斯国家实验室技术报告 LA-UR-16-26774,2016 年。可在以下网址获得:https://www.osti.gov/biblio/1565920-effective-openmp-implementations-in tel-knights-landing。
Yuliana Zamora, “Effective OpenMP Implementations on Intel’s Knights Landing,” Los Alamos National Laboratory Technical Report LA-UR-16-26774, 2016. Available at: https://www.osti.gov/biblio/1565920-effective-openmp-implementations-in tel-knights-landing.
一本关于 OpenMP 和 MPI 的好教科书是 Peter Pacheco 写的。它有一些很好的 OpenMP 代码示例:
A good textbook on OpenMP and MPI is one written by Peter Pacheco. It has some good examples of OpenMP code:
Peter Pacheco,并行编程简介 (Elsevier,2011 年)。
Peter Pacheco, An introduction to parallel programming (Elsevier, 2011).
劳伦斯利弗莫尔国家实验室的 Blaise Barney 撰写了一份写得很好的 OpenMP 参考资料,该参考资料也可以在线获取:
Blaise Barney at Lawrence Livermore National Laboratory has authored a well-written OpenMP reference that’s also available online:
Blaise Barney,OpenMP 教程,https://computing.llnl.gov/tutorials/openMP/
Blaise Barney, OpenMP Tutorial, https://computing.llnl.gov/tutorials/openMP/
OpenMP 架构审查委员会 (ARB) 维护着一个网站,该网站是 OpenMP 所有内容(从规范到演示文稿和教程)的权威位置:
The OpenMP Architecture Review Board (ARB) maintains a website that is the authoritative location for all things OpenMP, from specifications to presentations and tutorials:
OpenMP 架构审查委员会,OpenMP,https://www.openmp.org。
OpenMP Architecture Review Board, OpenMP, https://www.openmp.org.
For a deeper discussion on the difficulties with threading:
Edward A Lee,“线程的问题”。计算机 39,第 5 期(2006 年):33-42。
Edward A Lee, “The problem with threads.” Computer 39, no. 5 (2006): 33-42.
Convert the vector add example in listing 7.8 into a high-level OpenMP following the steps in section 7.2.2.
Write a routine to get the maximum value in an array. Add an OpenMP pragma to add thread parallelism to the routine.
Write a high-level OpenMP version of the reduction in the previous exercise.
在本章中,我们涵盖了大量的材料。这个坚实的基础将帮助您开发有效的 OpenMP 应用程序。
We covered a substantial amount of material in this chapter. This solid foundation will help you in developing an effective OpenMP application.
Loop-level implementations of OpenMP can be quick and easy to create.
An efficient implementation of OpenMP can achieve promising application speed-up.
Good first-touch implementations can often gain a 10-20% performance improvement.
Understanding variable scope across threads is important in getting OpenMP code to work.
High-level OpenMP can boost performance on current and upcoming many-core architectures.
Threading and debugging tools are essential when implementing more complex versions of OpenMP.
Some of the style guidelines that are suggested in this chapter include
Declaring variables where these are used so that they automatically become private, which is generally correct.
Modifying declarations to get the right threading scope for variables rather than using an extensive list in private and public clauses.
Avoiding the critical clause or other locking constructs where possible. Performance is generally impacted heavily by these constructs.
Reducing synchronization by adding nowait clauses to for loops and limiting the use of #pragma omp barrier to only where necessary.
Merging small parallel regions into fewer, larger parallel regions to reduce OpenMP overhead.
1. #pragma OMP 掩码是 #pragma OMP 主控。随着 2020 年 11 月 OpenMP 标准 v 5.1 的发布,术语“主”已更改为“屏蔽”,以解决其冒犯技术社区中许多人的担忧。我们是包容性的坚定倡导者,因此在本章中贯穿始终使用新语法。警告读者,编译器可能需要一些时间来实现此更改。请注意,本章随附的示例将使用较旧的语法,直到大多数编译器更新为止。
1. #pragma omp masked was #pragma omp master. With the release of OpenMP standard v 5.1 in Nov. 2020, the term “master” was changed to “masked” to address concerns that it is offensive to many in the technical community. We are strong advocates of inclusion and, thus, use the new syntax throughout this chapter. Readers are warned that compilers may take some time to implement the change. Note that the examples that accompany the chapter will use the older syntax until most compilers are updated.
2. 在 Fortran 77 标准下,情况并非如此!但是,即使使用 Fortran 77,某些编译器(如 DEC Fortran 编译器)也要求例程中的每个变量都具有 save 属性,从而导致模糊的错误和可移植性问题。知道了这一点,我们可以确保我们使用的是 Fortran 90 标准进行编译,并可能通过初始化数组指针来修复私有范围问题,这会导致它被移动到堆,使变量共享。
2. This is not the case under the Fortran 77 standard! But even with Fortran 77, some compilers such as the DEC Fortran compiler mandate that every variable in a routine have the save attribute, causing obscure bugs and portability problems. Knowing this, we could make sure we are compiling with the Fortran 90 standard and potentially fix the private scoping issue by initializing the array pointer, which causes it to be moved to the heap, making the variable shared.
消息传递接口 (MPI) 标准的重要性在于,它允许程序访问额外的计算节点,从而通过向模拟添加更多节点来运行越来越大的问题。消息传递 (message passing) 名称是指轻松地将消息从一个进程发送到另一个进程的能力。MPI 在高性能计算领域无处不在。在许多科学领域中,使用超级计算机都需要 MPI 实现。
The importance of the Message Passing Interface (MPI) standard is that it allows a program to access additional compute nodes and, thus, run larger and larger problems by adding more nodes to the simulation. The name message passing refers to the ability to easily send messages from one process to another. MPI is ubiquitous in the field of high-performance computing. Across many scientific fields, the use of supercomputers entails an MPI implementation.
MPI 于 1994 年作为开放标准推出,并在几个月内成为基于并行计算库的主要语言。自 1994 年以来,MPI 的使用导致了从物理学到机器学习再到自动驾驶汽车的科学突破!MPI 的几种实现现在被广泛使用。阿贡国家实验室的 MPICH 和 OpenMPI 是最常见的两种。硬件供应商通常为其平台提供这两种实现之一的自定义版本。MPI 标准(截至 2015 年)现已升级到 3.1 版,并且仍在不断发展和变化。
MPI was launched as an open standard in 1994 and, within months, became the dominant parallel computing library-based language. Since 1994, the use of MPI has led to scientific breakthroughs from physics to machine learning to self-driving cars! Several implementations of MPI are now in widespread use. MPICH from Argonne National Laboratories and OpenMPI are two of the most common. Hardware vendors often have customized versions of one of these two implementations for their platforms. The MPI standard, now up to version 3.1 as of 2015, continues to evolve and change.
在本章中,我们将向您展示如何在应用程序中实现 MPI。我们将从一个简单的 MPI 程序开始,然后进入一个更复杂的示例,了解如何通过传达边界信息将不同进程上的不同计算网格链接在一起。我们将介绍一些对于编写良好的 MPI 程序非常重要的高级技术,例如构建自定义 MPI 数据类型和使用 MPI 笛卡尔拓扑函数。最后,我们将介绍将 MPI 与 OpenMP(MPI 加 OpenMPI)和向量化相结合,以获得多级并行性。
In this chapter, we’ll show you how to implement MPI in your application. We’ll start with a simple MPI program and then progress to a more complicated example of how to link together separate computational meshes on separate processes through communicating boundary information. We’ll touch on some advanced techniques that are important for well-written MPI programs, such as building custom MPI data types and the use of MPI Cartesian topology functions. Last, we’ll introduce combining MPI with OpenMP (MPI plus OpenMPI) and vectorization to get multiple levels of parallelism.
注意我们鼓励您按照 https://github.com/EssentialsofParallelComputing/Chapter8 中的本章示例进行操作。
Note We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/Chapter8.
在本节中,我们将介绍最小 MPI 计划所需的基础知识。其中一些基本要求由 MPI 标准指定,而其他基本要求则由大多数 MPI 实现按约定提供。自第一个标准以来,MPI 的基本结构和操作一直保持非常一致。
In this section, we will cover the basics that are needed for a minimal MPI program. Some of these basic requirements are specified by the MPI standard, while others are provided by convention by most MPI implementations. The basic structure and operation of MPI has stayed remarkably consistent since the first standard.
首先,MPI 是一种完全基于库的语言。它不需要特殊的编译器或操作系统的调整。所有 MPI 程序都有一个基本的结构和过程,如图 8.1 所示。MPI 始终在程序开始时以 MPI_Init 调用开始,在程序退出处以 MPI_Finalize 开始。这与第 7 章中讨论的 OpenMP 形成对比,OpenMP 不需要特殊的启动和关闭命令,只需在关键循环周围放置 parallel 指令。
To begin, MPI is a completely library-based language. It does not require a special compiler or accommodations from the operating system. All MPI programs have a basic structure and process as figure 8.1 shows. MPI always begins with an MPI_Init call right at the start of the program and an MPI_Finalize at the program’s exit. This is in contrast to OpenMP, as discussed in chapter 7, which needs no special startup and shutdown commands and just places parallel directives around key loops.
图 8.1 MPI 方法是基于库的。只需编译,在 MPI 库中链接,然后使用特殊的并行启动程序启动。
Figure 8.1 The MPI approach is library-based. Just compile, linking in the MPI library, and launch with a special parallel startup program.
编写 MPI 并行程序后,将使用包含文件和库对其进行编译。然后,它使用一个特殊的启动程序执行,该程序在节点之间和节点内建立并行进程。
Once you write an MPI parallel program, it is compiled with an include file and library. Then it is executed with a special startup program that establishes the parallel processes across nodes and within the node.
基本的 MPI 函数调用包括 MPI_Init 和 MPI_Finalize。对 MPI_Init 的调用应在程序启动后立即进行,并且必须将主例程中的参数传递给初始化调用。典型的调用如下所示,可能会也可能不使用 return 变量:
The basic MPI function calls include MPI_Init and MPI_Finalize. The call to MPI_Init should be right after program startup, and the arguments from the main routine must be passed to the initialization call. Typical calls look like the following and may or may not use the return variable:
iret = MPI_Init(&argc, &argv); iret = MPI_Finalize();
iret = MPI_Init(&argc, &argv); iret = MPI_Finalize();
大多数程序都需要进程数和进程等级,该组可以通信,称为 communicator。MPI 的主要功能之一是启动远程进程并捆绑这些进程,以便可以在进程之间发送消息。默认通信器是 MPI_COMM_WORLD,它由 MPI_Init 在每个并行作业开始时设置。让我们花点时间看一下几个定义:
Most programs will need the number of processes and the process rank within the group that can communicate, called a communicator. One of the main functions of MPI is to start up remote processes and lash these up so messages can be sent between the processes. The default communicator is MPI_COMM_WORLD, which is set up at the beginning of every parallel job by MPI_Init. Let’s take a moment to look at a few definitions:
Process—An independent unit of computation that has ownership of a portion of memory and control over resources in user space.
Rank (排名) - 一个唯一的可移植标识符,用于区分进程集中的各个进程。通常,这将是整数集中的整数,从 0 到 1 小于进程数。
Rank—A unique, portable identifier to distinguish the individual process within the set of processes. Normally this would be an integer within the set of integers from zero to one less than the number of processes.
The calls to get these important variables are
iret = MPI_Comm_rank(MPI_COMM_WORLD, &rank); iret = MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
iret = MPI_Comm_rank(MPI_COMM_WORLD, &rank); iret = MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
尽管 MPI 是一个库,但我们可以通过使用 MPI 编译器包装器将其视为编译器。这使得 MPI 应用程序的构建变得更加容易,因为您不需要知道需要哪些库以及库的位置。这些对于小型 MPI 应用程序特别方便。每种编程语言都有编译器包装器:
Although MPI is a library, we can treat it like a compiler through the use of the MPI compiler wrappers. This makes the building of MPI applications easier because you don’t need to know which libraries are required and where the libraries are located. These are especially convenient for small MPI applications. There are compiler wrappers for each programming language:
使用这些包装器是可选的。如果您不使用编译器包装器,它们对于识别构建应用程序所需的编译标志仍然很有价值。mpicc 命令具有输出此信息的选项。您可以使用 man mpicc 为您的 MPI 找到这些选项。对于两种最流行的 MPI 实现,我们在此处列出了 mpicc、mpicxx 和 mpifort 的命令行选项。
Using these wrappers is optional. If you are not using the compiler wrappers, they can still be valuable for identifying the compile flags necessary for building your application. The mpicc command has options that output this information. You can find these options for your MPI with man mpicc. For the two most popular MPI implementations, we list the command-line options for mpicc, mpicxx, and mpifort here.
启动 MPI 的并行进程是一项复杂的操作,由特殊命令处理。起初,这个命令通常是 mpirun。但随着 1997 年 MPI 2.0 标准的发布,建议将启动命令设置为 mpiexec,以尝试提供更多的可移植性。然而,这种标准化的尝试并不完全成功,今天有几个名称用于 startup 命令:
The startup of the parallel processes for MPI is a complex operation that is handled by a special command. At first, this command was often mpirun. But with the release of the MPI 2.0 standard in 1997, the startup command was recommended to be mpiexec, to try and provide more portability. Yet this attempt at standardization was not completely successful, and today there are several names used for the startup command:
大多数 MPI 启动命令都使用 -n 选项来表示进程数,但其他 MPI 启动命令可能使用 -np。随着最近计算机节点架构的复杂性,启动命令有无数的选项,用于 affinity、placement 和 environment(其中一些我们将在第 14 章中讨论)。这些选项因每个 MPI 实现而异,甚至因 MPI 库的每个版本而异。原始启动命令中可用的选项的简单性已经演变成尚未稳定的令人困惑的选项泥潭。幸运的是,对于初级 MPI 用户来说,这些选项中的大多数都可以忽略,但它们对于高级使用和优化非常重要。
Most MPI startup commands take the option -n for the number of processes, but others might take -np. With the complexity of recent computer node architectures, the startup commands have a myriad of options for affinity, placement, and environment (some of which we will discuss in chapter 14). These options vary with each MPI implementation and even with each release of their MPI libraries. The simplicity of the options available from the original startup commands has morphed into a confusing morass of options that have not yet stabilized. Fortunately, for the beginning MPI user, most of these options can be ignored, but they are important for advanced use and tuning.
现在我们已经了解了所有基本组件,我们可以将它们组合成清单 8.1 所示的最小工作示例:我们启动并行作业并打印出每个进程的进程的排名和数量。在获取 rank 和 size 的调用中,我们使用 MPI_COMM_WORLD 变量,该变量是所有 MPI 进程的组,并在 MPI 头文件中预定义。请注意,显示的输出可以按任意顺序排列;MPI 程序将输出的显示时间和方式留给操作系统。
Now that we have learned all the basic components, we can combine them into the minimum working example that listing 8.1 shows: we start the parallel job and print out the rank and number of processes from each process. In the call to get the rank and size, we use the MPI_COMM_WORLD variable that is the group of all the MPI processes and is predefined in the MPI header file. Note that the displayed output can be in any order; the MPI program leaves it up to the operating system for when and how the output is displayed.
Listing 8.1 MPI minimum working example
MinWorkExampleMPI.c 1 #include <mpi.h> ❶ 2 #include <stdio.h> 3 int main(int argc, char **argv) 4 { 5 MPI_Init(&argc, &argv); ❷ 6 7 int rank, nprocs; 8 MPI_Comm_rank(MPI_COMM_WORLD, &rank); ❸ 9 MPI_Comm_size(MPI_COMM_WORLD, &nprocs); ❹ 10 11 printf("Rank %d of %d\n", rank, nprocs); 12 13 MPI_Finalize(); ❺ 14 return 0; 15 }
MinWorkExampleMPI.c 1 #include <mpi.h> ❶ 2 #include <stdio.h> 3 int main(int argc, char **argv) 4 { 5 MPI_Init(&argc, &argv); ❷ 6 7 int rank, nprocs; 8 MPI_Comm_rank(MPI_COMM_WORLD, &rank); ❸ 9 MPI_Comm_size(MPI_COMM_WORLD, &nprocs); ❹ 10 11 printf("Rank %d of %d\n", rank, nprocs); 12 13 MPI_Finalize(); ❺ 14 return 0; 15 }
❶ Include file for MPI functions and variables
❷ Initializes after program start, including program arguments
❸ Gets the rank number of the process
❹ Gets the number of ranks in the program determined by the mpirun command
❺ Finalizes MPI to synchronize ranks and then exits
清单 8.2 定义了一个简单的 makefile,用于使用 MPI 编译器包装器构建此示例。在本例中,我们使用 mpicc 包装器来提供 mpi.h 包含文件和 MPI 库的位置。
Listing 8.2 defines a simple makefile to build this example using the MPI compiler wrappers. In this case, we use the mpicc wrapper to supply the location of the mpi.h include file and the MPI library.
列表 8.2 使用 MPI 编译器包装器的简单 makefile
Listing 8.2 Simple makefile using MPI compiler wrappers
MinWorkExample/Makefile.simple
default: MinWorkExampleMPI
all: MinWorkExampleMPI
MinWorkExampleMPI: MinWorkExampleMPI.c Makefile
mpicc MinWorkExampleMPI.c -o MinWorkExampleMPI
clean:
rm -f MinWorkExampleMPI MinWorkExampleMPI.oMinWorkExample/Makefile.simple
default: MinWorkExampleMPI
all: MinWorkExampleMPI
MinWorkExampleMPI: MinWorkExampleMPI.c Makefile
mpicc MinWorkExampleMPI.c -o MinWorkExampleMPI
clean:
rm -f MinWorkExampleMPI MinWorkExampleMPI.o
对于各种系统上的更详细构建,您可能更喜欢 CMake。下面的清单显示了该程序的 CMakeLists.txt 文件。
For more elaborate builds on a variety of systems, you might prefer CMake. The following listing shows the CMakeLists.txt file for this program.
清单 8.3 使用 CMake 构建的CMakeLists.txt
Listing 8.3 The CMakeLists.txt for building with CMake
MinWorkExample/CMakeLists.txt cmake_minimum_required(VERSION 2.8) project(MinWorkExampleMPI) # Require MPI for this project: find_package(MPI REQUIRED) ❶ add_executable(MinWorkExampleMPI MinWorkExampleMPI.c) target_include_directories(MinWorkExampleMPI ❷ PRIVATE ${MPI_C_INCLUDE_PATH}) ❷ target_compile_options(MinWorkExampleMPI ❷ PRIVATE ${MPI_C_COMPILE_FLAGS}) ❷ target_link_libraries(MinWorkExampleMPI ❷ ${MPI_C_LIBRARIES} ${MPI_C_LINK_FLAGS}) ❷ # Add a test: enable_testing() add_test(MPITest ${MPIEXEC} ${MPIEXEC_NUMPROC_FLAG} ❸ ${MPIEXEC_MAX_NUMPROCS} ❸ ${MPIEXEC_PREFLAGS} ❸ ${CMAKE_CURRENT_BINARY_DIR}/MinWorkExampleMPI ❸ ${MPIEXEC_POSTFLAGS}) ❸ # Cleanup add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles Makefile cmake_install.cmake CTestTestfile.cmake Testing)
MinWorkExample/CMakeLists.txt cmake_minimum_required(VERSION 2.8) project(MinWorkExampleMPI) # Require MPI for this project: find_package(MPI REQUIRED) ❶ add_executable(MinWorkExampleMPI MinWorkExampleMPI.c) target_include_directories(MinWorkExampleMPI ❷ PRIVATE ${MPI_C_INCLUDE_PATH}) ❷ target_compile_options(MinWorkExampleMPI ❷ PRIVATE ${MPI_C_COMPILE_FLAGS}) ❷ target_link_libraries(MinWorkExampleMPI ❷ ${MPI_C_LIBRARIES} ${MPI_C_LINK_FLAGS}) ❷ # Add a test: enable_testing() add_test(MPITest ${MPIEXEC} ${MPIEXEC_NUMPROC_FLAG} ❸ ${MPIEXEC_MAX_NUMPROCS} ❸ ${MPIEXEC_PREFLAGS} ❸ ${CMAKE_CURRENT_BINARY_DIR}/MinWorkExampleMPI ❸ ${MPIEXEC_POSTFLAGS}) ❸ # Cleanup add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles Makefile cmake_install.cmake CTestTestfile.cmake Testing)
❶ Calls a special module to find MPI and sets variables
现在使用 CMake 构建系统,让我们使用以下命令配置、构建并运行测试:
Now using the CMake build system, let’s configure, build, and then run the test with these commands:
cmake . make make test
cmake . make make test
printf 命令的写入操作按任意顺序显示输出。最后,要在运行后进行清理,请使用以下命令:
The write operation from the printf command displays output in any order. Finally, to clean up after the run, use these commands:
make clean make distclean
make clean make distclean
消息传递方法的核心是从点对点发送消息,或者更准确地说,从进程到进程发送消息。并行处理的全部意义在于协调工作。为此,您需要发送消息以进行控制或工作分发。我们将向您展示如何撰写和正确发送这些消息。点对点例程有许多变体;我们将介绍建议在大多数情况下使用的那些。
The core of the message-passing approach is to send a message from point-to-point or, perhaps more precisely, process-to-process. The whole point of parallel processing is to coordinate work. To do this, you need to send messages either for control or work distribution. We’ll show you how these messages are composed and properly sent. There are many variants of the point-to-point routines; we’ll cover those that are recommended to use in most situations.
图 8.2 显示了消息的组成部分。系统的两端都必须有一个邮箱。邮箱的大小很重要。发送端知道消息的大小,但接收端不知道。为了确保有存储消息的地方,通常最好先发布 receive。这避免了接收进程必须分配临时空间来存储消息,直到发布接收并将其复制到正确的位置,从而延迟消息。打个比方,如果接收(邮箱)没有投递(不在那里),邮递员就必须闲逛,直到有人放一个。首先发布接收可以避免接收端内存空间不足以分配临时缓冲区来存储消息的可能性。
Figure 8.2 shows the components of a message. There must be a mailbox at either end of the system. The size of the mailbox is important. The sending side knows the size of the message, but the receiving side does not. To make sure there is a place for the message to be stored, it is usually better to post the receive first. This avoids delaying the message by the receiving process having to allocate a temporary space to store the message until a receive is posted and it can copy it to the right location. For an analogy, if the receive (mailbox) is not posted (not there), the postman has to hangout until someone puts one up. Posting the receive first avoids the possibility of insufficient memory space on the receiving end to allocate a temporary buffer to store the message.
图 8.2 MPI 中的消息始终由指向内存的指针、计数和类型组成。信封的地址由等级、标记和通信组以及内部 MPI 上下文组成。
Figure 8.2 A message in MPI is always composed of a pointer to memory, a count, and a type. The envelope has an address composed of a rank, a tag, and a communication group along with an internal MPI context.
消息本身始终由两端的三元组组成:指向内存缓冲区的指针、计数和类型。type sent (发送的类型) 和 type received (接收的类型) 可以是不同的类型和计数。使用 types 和 counts 的基本原理是,它允许在源和目标的进程之间转换类型。这允许在接收端将消息转换为不同的格式。在异构环境中,这可能意味着将 lower-endian 转换为 big-endian,这是存储在不同硬件供应商上的数据的字节顺序的低级差异。此外,接收大小可能大于发送的数量。这允许接收方查询发送了多少数据,以便能够正确处理消息。但是接收大小不能小于发送大小,因为这会导致写入超过缓冲区的末尾。
The message itself is always composed of a triplet at both ends: a pointer to a memory buffer, a count, and a type. The type sent and type received can be different types and counts. The rationale for using types and counts is that it allows the conversion of types between the processes at the source and at the destination. This permits a message to be converted to a different form at the receiving end. In a heterogeneous environment, this might mean converting lower-endian to big-endian, a low-level difference in the byte order of data stored on different hardware vendors. Also, the receive size can be greater than the amount sent. This permits the receiver to query how much data is sent so it can properly handle the message. But the receiving size cannot be smaller than the sending size because it would cause a write past the end of the buffer.
包络也由三连音组成。它定义了消息的发件人、发送给谁以及消息标识符,以防止混淆多条消息。三元组由 rank、tag 和 communication 组组成。排名适用于指定的通信组。该标签可帮助程序员和 MPI 区分哪个消息发送到哪个接收。在 MPI 中,标签是一种便利。如果不需要显式标记编号,则可以将其设置为 MPI_ANY_TAG。MPI 使用在库内部创建的上下文来正确分隔消息。communicator 和 tag 必须匹配才能完成消息。
The envelope also is composed of a triplet. It defines who the message is from, who it is sent to, and a message identifier to keep from getting multiple messages confused. The triplet consists of the rank, tag, and communication group. The rank is for the specified communication group. The tag helps the programmer and MPI distinguish which message goes to which receive. In MPI, the tag is a convenience. It can be set to MPI_ANY_TAG if an explicit tag number is not desired. MPI uses a context created internally within the library to separate the messages correctly. Both the communicator and the tag must match for a message to complete.
注意消息传递方法的优势之一是内存模型。每个进程都有明确的数据所有权,以及对数据更改时间的控制和同步。你可以保证,当你背对着你时,其他一些过程无法改变你的记忆。
Note One of the strengths of the message-passing approach is the memory model. Each process has clear ownership of its data plus the control and synchronization over when the data changes. You can be guaranteed that some other process cannot change your memory while your back is turned.
现在让我们尝试一个具有简单发送/接收的 MPI 程序。我们必须在一个进程上发送数据,在另一个进程上接收数据。我们可以用不同的方式对几个进程发出这些调用(图 8.3)。一些基本阻塞发送和接收的组合不安全,可能会挂起,例如图 8.3 左侧的两种组合。第三种组合需要使用条件语句进行仔细编程。最右侧的方法是通过使用非阻塞发送和接收来安排通信的几种安全方法之一。这些调用也称为异步调用或立即调用,这解释了 send 和 receive 关键字s 之前的 I 字符(大小写显示在图的最右侧)。
Now let’s try an MPI program with a simple send/receive. We have to send data on one process and receive data on another. There are different ways that we could issue these calls on a couple of processes (figure 8.3). Some of the combinations of basic blocking send and receives are not safe and can hang, such as the two combinations on the left of figure 8.3. The third combination requires careful programming with conditionals. The method to the far right is one of several safe methods to schedule communications by using non-blocking sends and receives. These are also called asynchronous or immediate calls, which explains the I character preceding the send and receive keywords (the case shown on the far right of the figure).
图 8.3 阻塞发送和接收的顺序很难正确完成。使用发送和接收操作的非阻塞或即时形式,然后等待完成要安全、更快捷。
Figure 8.3 The ordering of blocking send and receives is tricky to do correctly. It is much safer and faster to use the non-blocking or immediate forms of the send and receive operations and then wait for completion.
最基本的 MPI 发送和接收是 MPI_Send 和 MPI_Recv。基本的 send 和 receive 函数具有以下原型:
The most basic MPI send and receive is MPI_Send and MPI_Recv. The basic send and receive functions have the following prototypes:
MPI_Send(void *data, int count, MPI_Datatype datatype, int dest, int tag,
MPI_COMM comm)
MPI_Recv(void *data, int count, MPI_Datatype datatype, int source, int tag,
MPI_COMM comm, MPI_Status *status)MPI_Send(void *data, int count, MPI_Datatype datatype, int dest, int tag,
MPI_COMM comm)
MPI_Recv(void *data, int count, MPI_Datatype datatype, int source, int tag,
MPI_COMM comm, MPI_Status *status)
现在让我们看一下图 8.3 中的四种情况中的每一种,以了解为什么有些挂起而有些工作正常。我们将从前面的函数原型和图中最左侧的示例中显示的MPI_Send和MPI_Receive开始。这两个例程都是阻塞的。阻止意味着在满足特定条件之前,这些内容不会返回。在这两个调用的情况下,return 的条件是缓冲区可以安全地再次使用。在发送时,缓冲区必须已读取且不再需要。在接收时,必须填充缓冲区。如果通信中的两个进程都阻塞,则可能会发生称为挂起的情况。当一个或多个进程正在等待永远不会发生的事件时,将发生挂起。
Now let’s go through each of the four cases in figure 8.3 to understand why some hang and some work fine. We’ll begin with the MPI_Send and MPI_Receive that were shown in the previous function prototypes and in the left-most example in the figure. Both of these routines are blocking. Blocking means that these do not return until a specific condition is fulfilled. In the case of these two calls, the condition for return is that the buffer is safe to use again. On the send, the buffer must have been read and is no longer needed. On the receive, the buffer must be filled. If both processes in a communication are blocking, a situation known as a hang can occur. A hang occurs when one or more processes are waiting on an event that can never occur.
让我们尝试颠倒发送和接收的顺序。我们在以下清单中列出了与上一个示例中的原始清单相比的更改行。
Let’s try reversing the order of the sends and receives. We list the changed lines in the following listing from the original listing in the previous example.
清单 8.4 MPI 中的简单 send/receive 示例(有时会失败)
Listing 8.4 A simple send/receive example in MPI (sometimes fails)
Send_Recv/SendRecv2.c 28 MPI_Send(xsend, count, MPI_DOUBLE, ❶ partner_rank, tag, comm); ❶ 29 MPI_Recv(xrecv, count, MPI_DOUBLE, ❷ partner_rank, tag, comm, ❷ MPI_STATUS_IGNORE); ❷
Send_Recv/SendRecv2.c 28 MPI_Send(xsend, count, MPI_DOUBLE, ❶ partner_rank, tag, comm); ❶ 29 MPI_Recv(xrecv, count, MPI_DOUBLE, ❷ partner_rank, tag, comm, ❷ MPI_STATUS_IGNORE); ❷
❷ Then calls receive operation after send completes
那么这个失败了吗?嗯,这要看情况。send 调用在发送数据缓冲区使用完成后返回。如果大小足够小,大多数 MPI 实现会将数据复制到发送方或接收方上的预分配缓冲区中。在这种情况下,发送完成并调用 receive 。如果消息很大,则 send 将等待 receive 调用分配一个缓冲区,以便在返回之前将消息放入其中。但是 receive 永远不会被调用,因此程序挂起。我们可以按等级交替发布 send 和 receives ,这样就不会发生挂起。我们必须为这个变体使用一个条件,如下面的清单所示。
So does this one fail? Well, it depends. The send call returns after the use of the send data buffer is complete. Most MPI implementations will copy the data into preallocated buffers on the sender or receiver if the size is small enough. In this case, the send completes and the receive is called. If the message is large, the send waits for the receive call to allocate a buffer to put the message into before returning. But the receive never gets called, so the program hangs. We could alternate the posting of sends and receives by ranks so that hangs do not occur. We have to use a conditional for this variant as the following listing shows.
Listing 8.5 Send/receive with alternating sends and receives by rank
Send_Recv/SendRecv3.c
28 if (rank%2 == 0) { ❶
29 MPI_Send(xsend, count, MPI_DOUBLE, partner_rank, tag, comm);
30 MPI_Recv(xrecv, count, MPI_DOUBLE, partner_rank, tag, comm,
MPI_STATUS_IGNORE);
31 } else { ❷
32 MPI_Recv(xrecv, count, MPI_DOUBLE, partner_rank, tag, comm,
MPI_STATUS_IGNORE);
33 MPI_Send(xsend, count, MPI_DOUBLE, partner_rank, tag, comm);
34 }Send_Recv/SendRecv3.c
28 if (rank%2 == 0) { ❶
29 MPI_Send(xsend, count, MPI_DOUBLE, partner_rank, tag, comm);
30 MPI_Recv(xrecv, count, MPI_DOUBLE, partner_rank, tag, comm,
MPI_STATUS_IGNORE);
31 } else { ❷
32 MPI_Recv(xrecv, count, MPI_DOUBLE, partner_rank, tag, comm,
MPI_STATUS_IGNORE);
33 MPI_Send(xsend, count, MPI_DOUBLE, partner_rank, tag, comm);
34 }
❶ Even ranks post the send first.
❷ Odd ranks do the receive first.
但是这在更复杂的通信中要正确处理很复杂,并且需要小心使用条件语句。实现这一点的更好方法是使用 MPI_Sendrecv 调用,如下一个清单所示。通过使用此调用,您可以将正确执行通信的责任移交给 MPI 库。这对程序员来说是一笔相当划算的交易。
But this is complicated to get right in more complex communication and requires careful use of conditionals. A better way to implement this is by using the MPI_Sendrecv call as the next listing shows. By using this call, you hand-off the responsibility for correctly executing the communication to the MPI library. This is a pretty good deal for the programmer.
清单 8.6 使用 MPI_Sendrecv 调用 Send/receive
Listing 8.6 Send/receive with the MPI_Sendrecv call
Send_Recv/SendRecv4.c 28 MPI_Sendrecv(xsend, count, MPI_DOUBLE, ❶ partner_rank, tag, ❶ 29 xrecv, count, MPI_DOUBLE, ❶ partner_rank, tag, comm, ❶ MPI_STATUS_IGNORE); ❶
Send_Recv/SendRecv4.c 28 MPI_Sendrecv(xsend, count, MPI_DOUBLE, ❶ partner_rank, tag, ❶ 29 xrecv, count, MPI_DOUBLE, ❶ partner_rank, tag, comm, ❶ MPI_STATUS_IGNORE); ❶
❶ 组合的发送/接收呼叫取代了单独的 MPI_Send 和 MPI_Recv。
❶ A combined send/receive call replaces the individual MPI_Send and MPI_Recv.
MPI_Sendrecv 调用是一个很好的示例,说明了使用我们将在 8.3 节中介绍的集体通信调用的优势。最好尽可能使用集体通信调用,因为这些调用将避免挂起和死锁的责任以及良好性能的责任委托给 MPI 库。
The MPI_Sendrecv call is a good example of the advantages of using the collective communication calls that we’ll present in section 8.3. It is good practice to use the collective communication calls when possible because these delegate responsibility for avoiding hangs and deadlocks, as well as the responsibility for good performance to the MPI library.
作为前面例子中阻塞通信调用的替代方案,我们看看使用 MPI_Isend 和 MPI_Irecv 在 Listing 8.7 中。这些版本称为即时 (I) 版本,因为这些版本会立即返回。这通常称为异步调用或非阻塞调用。异步意味着调用启动操作,但不等待工作完成。
As an alternative to the blocking communication calls in previous examples, we look at using the MPI_Isend and MPI_Irecv in listing 8.7. These are called immediate (I) versions because these return immediately. This is often referred to as asynchronous or non-blocking calls. Asynchronous means that the call initiates the operation but does not wait for the completion of the work.
清单 8.7 使用 Isend 和 Irecv 的简单发送/接收示例
Listing 8.7 A simple send/receive example using Isend and Irecv
Send_Recv/SendRecv5.c
27 MPI_Request requests[2] =
{MPI_REQUEST_NULL, MPI_REQUEST_NULL}; ❶
28
29 MPI_Irecv(xrecv, count, MPI_DOUBLE, ❷
partner_rank, tag, comm, ❷
&requests[0]); ❷
30 MPI_Isend(xsend, count, MPI_DOUBLE, ❸
partner_rank, tag, comm, ❸
&requests[1]); ❸
31 MPI_Waitall(2, requests, MPI_STATUSES_IGNORE); ❹Send_Recv/SendRecv5.c
27 MPI_Request requests[2] =
{MPI_REQUEST_NULL, MPI_REQUEST_NULL}; ❶
28
29 MPI_Irecv(xrecv, count, MPI_DOUBLE, ❷
partner_rank, tag, comm, ❷
&requests[0]); ❷
30 MPI_Isend(xsend, count, MPI_DOUBLE, ❸
partner_rank, tag, comm, ❸
&requests[1]); ❸
31 MPI_Waitall(2, requests, MPI_STATUSES_IGNORE); ❹
❶ 定义请求数组并设置为 null,以便在测试完成时定义这些请求
❶ Defines an array of requests and sets to null so these are defined when tested for completion
❸ The Isend is then called after the Irecv completes.
❹ Calls a Waitall to wait for the send and receive to complete
每个进程都在列表的第 31 行的 MPI_Waitall 等待消息完成。通过减少阻止每个发送和接收调用到单个MPI_Waitall的位置数,您还应该看到程序性能的显著改进。但必须小心,在操作完成之前,不要修改发送缓冲区或访问接收缓冲区。还有其他有效的组合。让我们看看下面的清单,它使用了一种可能性。
Each process waits at the MPI_Waitall on line 31 of the listing for message completion. You should also see a measurable improvement in program performance by reducing the number of places that block from every send and receive call to just the single MPI_Waitall. But you must be careful not to modify the send buffer or access the receive buffer until the operation completes. There are other combinations that work. Let’s look at the following listing, which uses one possibility.
Listing 8.8 A mixed immediate and blocking send/receive example
Send_Recv/SendRecv6.c 27 MPI_Request request; 28 29 MPI_Isend(xsend, count, MPI_DOUBLE, ❶ partner_rank, tag, comm, ❶ &request); ❶ 30 MPI_Recv(xrecv, count, MPI_DOUBLE, ❷ partner_rank, tag, comm, ❷ MPI_STATUS_IGNORE); ❷ 31 MPI_Request_free(&request); ❸
Send_Recv/SendRecv6.c 27 MPI_Request request; 28 29 MPI_Isend(xsend, count, MPI_DOUBLE, ❶ partner_rank, tag, comm, ❶ &request); ❶ 30 MPI_Recv(xrecv, count, MPI_DOUBLE, ❷ partner_rank, tag, comm, ❷ MPI_STATUS_IGNORE); ❷ 31 MPI_Request_free(&request); ❸
❶ Posts the send with an MPI_Isend so that it returns
❷ Calls the blocking receive. This process can continue as soon as it returns.
❸ Frees the request handle to avoid a memory leak
我们用 asynchronous send 开始通信,然后用 blocking receive 来阻止。阻塞接收完成后,即使发送尚未完成,此过程也可以继续。您仍然必须使用 MPI_Request_free 或作为调用 MPI_Wait 或 MPI_Test 的副作用来释放请求句柄,以避免内存泄漏。您也可以在MPI_Isend后立即致电 MPI_Request_free。
We start the communication with an asynchronous send and then block with a blocking receive. Once the blocking receive completes, this process can continue even if the send has not completed. You still must free the request handle with an MPI_Request_free or as a side-effect of a call to MPI_Wait or an MPI_Test to avoid a memory leak. You can also call the MPI_Request_free immediately after the MPI_Isend.
send/receive 的其他变体在特殊情况下可能很有用。模式由一个或两个字母的前缀表示,类似于 immediate 变体中的前缀,如下所示:
Other variants of send/receive might be useful in special situations. The modes are indicated by a one- or two-letter prefix, similar to that seen in the immediate variant, as listed here:
C 的预定义 MPI 数据类型列表非常广泛;数据类型映射到 C 语言中的几乎所有类型。MPI 还具有与 Fortran 数据类型对应的类型。我们只列出了 C 中最常见的 C:
The list of predefined MPI data types for C is extensive; the data types map to nearly all the types in the C language. MPI also has types corresponding to Fortran data types. We list just the most common ones for C:
MPI_PACKED 和 MPI_BYTE 是特殊类型,与任何其他类型的匹配。MPI_BYTE 表示非类型化值,count 指定字节数。它绕过了异构数据通信中的任何数据转换操作。MPI_PACKED 与 MPI_PACK 例程一起使用,如 8.4.3 节中的 ghost exchange 示例所示。您还可以定义自己的数据类型以在这些调用中使用。这在 ghost exchange 示例中也得到了演示。还有许多通信完成测试例程,其中包括
The MPI_PACKED and MPI_BYTE are special types and match any other type. MPI_BYTE indicates an untyped value and the count specifies the number of bytes. It bypasses any data conversion operations in heterogeneous data communications. MPI_PACKED is used with the MPI_PACK routine as the ghost exchange example in section 8.4.3 shows. You can also define your own data type to use in these calls. This is also demonstrated in the ghost exchange example. There are also many communication completion testing routines, which include
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
int MPI_Testany(int count, MPI_Request requests[], int *index, int *flag,
MPI_Status *status)
int MPI_Testall(int count, MPI_Request requests[], int *flag,
MPI_Status statuses[])
int MPI_Testsome(int incount, MPI_Request requests[], int *outcount,
int indices[], MPI_Status statuses[])
int MPI_Wait(MPI_Request *request, MPI_Status *status)
int MPI_Waitany(int count, MPI_Request requests[], int *index,
MPI_Status *status)
int MPI_Waitall(int count, MPI_Request requests[], MPI_Status statuses[])
int MPI_Waitsome(int incount, MPI_Request requests[], int *outcount,
int indices[], MPI_Status statuses[])
int MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status)int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
int MPI_Testany(int count, MPI_Request requests[], int *index, int *flag,
MPI_Status *status)
int MPI_Testall(int count, MPI_Request requests[], int *flag,
MPI_Status statuses[])
int MPI_Testsome(int incount, MPI_Request requests[], int *outcount,
int indices[], MPI_Status statuses[])
int MPI_Wait(MPI_Request *request, MPI_Status *status)
int MPI_Waitany(int count, MPI_Request requests[], int *index,
MPI_Status *status)
int MPI_Waitall(int count, MPI_Request requests[], MPI_Status statuses[])
int MPI_Waitsome(int incount, MPI_Request requests[], int *outcount,
int indices[], MPI_Status statuses[])
int MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status)
此处未列出 MPI_Probe 的其他变体。本章的几个示例MPI_Waitall 进行了说明。其他例程在更特殊的情况下很有用。例程的名称可以很好地说明这些例程提供的功能。
There are additional variants of the MPI_Probe that are not listed here. MPI_Waitall is shown in several examples in this chapter. The other routines are useful in more specialized situations. The name of the routines gives a good idea of the capabilities that these provide.
在本节中,我们将了解 MPI 中丰富的集体通信调用集。集体通信对 MPI 通信器中包含的一组进程进行操作。要对部分进程集进行操作,您可以为MPI_COMM_WORLD子集(如所有其他进程)创建自己的 MPI 通信器。然后,您可以在集体通信呼叫中使用 communicator 代替 MPI_COMM_WORLD。大多数集体通信例程都对数据进行操作。图 8.4 直观地说明了每个集体操作的作用。
In this section, we’ll look at the rich set of collective communication calls in MPI. Collective communications operate on a group of processes contained in an MPI communicator. To operate on a partial set of processes, you can create your own MPI communicator for a subset of MPI_COMM_WORLD such as every other process. Then you can use your communicator in place of MPI_COMM_WORLD in collective communication calls. Most of the collective communication routines operate on data. Figure 8.4 gives a visual idea of what each collective operation does.
图 8.4 最常见的 MPI 集合例程的数据移动为并行程序提供了重要的功能。其他变体 MPI_Scatterv、 MPI_Gatherv 和 MPI_Allgatherv 允许从进程发送或接收可变数量的数据。未显示一些其他例程,例如 MPI_Alltoall 和类似函数。
Figure 8.4 The data movement of the most common MPI collective routines provide important functions for parallel programs. Additional variants MPI_Scatterv, MPI_Gatherv, and MPI_Allgatherv allow a variable amount of data to be sent or received from the processes. Not shown are some additional routines such as the MPI_Alltoall and similar functions.
我们将提供如何使用最常用的集合操作的示例,因为这些操作可能应用于应用程序中。第一个例子(在 8.3.1 节中)展示了如何使用 barrier。它是唯一不对数据进行操作的集体例程。然后,我们将展示一些示例,包括广播(第 8.3.2 节)、缩减(第 8.3.3 节),最后是分散/收集操作(第 8.3.4 节和第 8.3.5 节)。MPI 还具有各种 all-to-all 例程。但这些成本高昂且很少使用,因此我们不会在这里介绍它们。这些集合操作都对通信组表示的一组进程进行操作。通信组的所有成员都必须调用 Collective,否则您的程序将挂起。
We’ll present examples of how to use the most commonly used collective operations as these might be applied in an application. The first example (in section 8.3.1) shows how you might use the barrier. It is the only collective routine that does not operate on data. Then we’ll show some examples with the broadcast (section 8.3.2), reduction (section 8.3.3), and finally, scatter/gather operations (sections 8.3.4 and 8.3.5). MPI also has a variety of all-to-all routines. But these are costly and rarely used, so we won’t cover those here. These collective operations all operate on a group of processes represented by a communication group. All members of a communication group must call the collective or your program will hang.
最简单的集体通信调用是 MPI_Barrier。它用于同步 MPI 通信器中的所有进程。在大多数程序中,它不是必需的,但它通常用于调试和同步计时器。让我们看看如何使用 MPI_Barrier 来同步下面的清单中的计时器。我们还使用 MPI_Wtime 函数来获取当前时间。
The simplest collective communication call is MPI_Barrier. It is used to synchronize all of the processes in an MPI communicator. In most programs, it should not be necessary, but it is often used for debugging and for synchronizing timers. Let’s look at how MPI_Barrier could be used to synchronize timers in the following listing. We also use the MPI_Wtime function to get the current time.
清单 8.9 使用 MPI_Barrier 同步 MPI 程序中的计时器
Listing 8.9 Using MPI_Barrier to synchronize a timer in an MPI program
SynchronizedTimer/SynchronizedTimer1.c
1 #include <mpi.h>
2 #include <unistd.h>
3 #include <stdio.h>
4 int main(int argc, char *argv[])
5 {
6 double start_time, main_time;
7
8 MPI_Init(&argc, &argv);
9 int rank;
10 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
11
12 MPI_Barrier(MPI_COMM_WORLD); ❶
13 start_time = MPI_Wtime(); ❷
14
15 sleep(30); ❸
16
17 MPI_Barrier(MPI_COMM_WORLD); ❹
18 main_time = MPI_Wtime() - start_time; ❺
19 if (rank == 0) printf("Time for work is %lf seconds\n", main_time);
20
21 MPI_Finalize();
22 return 0;
23 }SynchronizedTimer/SynchronizedTimer1.c
1 #include <mpi.h>
2 #include <unistd.h>
3 #include <stdio.h>
4 int main(int argc, char *argv[])
5 {
6 double start_time, main_time;
7
8 MPI_Init(&argc, &argv);
9 int rank;
10 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
11
12 MPI_Barrier(MPI_COMM_WORLD); ❶
13 start_time = MPI_Wtime(); ❷
14
15 sleep(30); ❸
16
17 MPI_Barrier(MPI_COMM_WORLD); ❹
18 main_time = MPI_Wtime() - start_time; ❺
19 if (rank == 0) printf("Time for work is %lf seconds\n", main_time);
20
21 MPI_Finalize();
22 return 0;
23 }
❶ Synchronizes all the processes so these start at about the same time
❷ Gets the starting value of the timer using the MPI_Wtime routine
❹ Synchronizes the processes to get the longest time taken
❺ Gets the timer value and subtracts the starting value to get the elapsed time
屏障在启动计时器之前插入,然后在停止计时器之前插入。这会强制所有进程上的计时器大约在同一时间启动。通过在停止计时器之前插入屏障,我们可以获得所有进程的最大时间。有时,使用同步计时器可以减少时间测量,但在其他情况下,非同步计时器会更好。
The barrier is inserted before starting the timer and then just before stopping the timer. This forces the timers on all of the processes to start at about the same time. By inserting the barrier before stopping the timer, we get the maximum time across all of the processes. Sometimes using a synchronized timer gives a less confusing measure of time, but in others, an unsynchronized timer is better.
注意同步计时器和屏障不应用于生产运行;这些可能会导致应用程序严重变慢。
Note Synchronized timers and barriers should not be used in production runs; these can cause serious slowdowns in an application.
广播将数据从一个处理器发送到所有其他处理器。此操作如左上角的图 8.4 所示。MPI_Bcast,广播的用途之一是将从 Importing 文件中读取的值发送到所有其他进程。如果每个进程都尝试以较大的进程计数打开文件,则可能需要几分钟才能完成文件打开。这是因为文件系统本质上是串行的,并且是计算机系统中速度较慢的组件之一。由于这些原因,对于小文件输入,最好只从单个进程打开和读取文件。下面的清单显示了执行此操作的方法。
The broadcast sends data from one processor to all of the others. This operation is shown in figure 8.4 in the upper left. One of the uses of the broadcast, MPI_Bcast, is to send values read from an input file to all other processes. If every process tries to open a file at large process counts, it can take minutes to complete the file open. This is because file systems are inherently serial and one of the slower components of a computer system. For these reasons, for small file input, it is a good practice to only open and read a file from a single process. The following listing shows the way to do this.
Listing 8.10 使用 MPI_Bcast 处理小文件输入
Listing 8.10 Using MPI_Bcast to handle small file input
FileRead/FileRead.c
1 #include <stdio.h>
2 #include <string.h>
3 #include <stdlib.h>
4 #include <mpi.h>
5 int main(int argc, char *argv[])
6 {
7 int rank, input_size;
8 char *input_string, *line;
9 FILE *fin;
10
11 MPI_Init(&argc, &argv);
12 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
13
14 if (rank == 0){
15 fin = fopen("file.in", "r");
16 fseek(fin, 0, SEEK_END); ❶
17 input_size = ftell(fin); ❶
18 fseek(fin, 0, SEEK_SET); ❷
19 input_string = (char *)malloc((input_size+1)*sizeof(char));
20 fread(input_string, 1, input_size, fin); ❸
21 input_string[input_size] = '\0'; ❹
22 }
23
24 MPI_Bcast(&input_size, 1, MPI_INT, 0, ❺
MPI_COMM_WORLD); ❺
25 if (rank != 0) ❻
input_string = ❻
(char *)malloc((input_size+1)* ❻
sizeof(char)); ❻
26 MPI_Bcast(input_string, input_size, ❼
MPI_CHAR, 0, MPI_COMM_WORLD); ❼
27
28 if (rank == 0) fclose(fin);
29
30 line = strtok(input_string,"\n");
31 while (line != NULL){
32 printf("%d:input string is %s\n",rank,line);
33 line = strtok(NULL,"\n");
34 }
35 free(input_string);
36
37 MPI_Finalize();
38 return 0;
39 }FileRead/FileRead.c
1 #include <stdio.h>
2 #include <string.h>
3 #include <stdlib.h>
4 #include <mpi.h>
5 int main(int argc, char *argv[])
6 {
7 int rank, input_size;
8 char *input_string, *line;
9 FILE *fin;
10
11 MPI_Init(&argc, &argv);
12 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
13
14 if (rank == 0){
15 fin = fopen("file.in", "r");
16 fseek(fin, 0, SEEK_END); ❶
17 input_size = ftell(fin); ❶
18 fseek(fin, 0, SEEK_SET); ❷
19 input_string = (char *)malloc((input_size+1)*sizeof(char));
20 fread(input_string, 1, input_size, fin); ❸
21 input_string[input_size] = '\0'; ❹
22 }
23
24 MPI_Bcast(&input_size, 1, MPI_INT, 0, ❺
MPI_COMM_WORLD); ❺
25 if (rank != 0) ❻
input_string = ❻
(char *)malloc((input_size+1)* ❻
sizeof(char)); ❻
26 MPI_Bcast(input_string, input_size, ❼
MPI_CHAR, 0, MPI_COMM_WORLD); ❼
27
28 if (rank == 0) fclose(fin);
29
30 line = strtok(input_string,"\n");
31 while (line != NULL){
32 printf("%d:input string is %s\n",rank,line);
33 line = strtok(NULL,"\n");
34 }
35 free(input_string);
36
37 MPI_Finalize();
38 return 0;
39 }
❶ Gets the file size to allocate an input buffer
❷ Resets the file pointer to the start of file
❹ Null terminating input buffer
❺ Broadcasts size of input buffer
❻ Allocates input buffer on other processes
广播较大的数据块比广播许多小的单个值要好。因此,我们广播了整个文件。为此,我们需要首先广播大小,以便每个进程都可以分配一个 input 缓冲区,然后广播数据。文件读取和广播从排名 0 开始完成,通常称为主进程。
It is better to broadcast larger chunks of data than it is to broadcast many small individual values. We therefore broadcast the entire file. To do this, we need to first broadcast the size so that every process can allocate an input buffer and then broadcast the data. The file read and broadcasts are done from rank 0, generally referred to as the main process.
MPI_Bcast 接受第一个参数的指针,因此在发送标量变量时,我们使用 & 符号 (&) 运算符来获取变量的地址来发送引用。然后是 count 和 type 来完全定义要发送的数据。next 参数指定原始进程。在这两个调用中,它都是 0,因为这是数据所在的排名。然后,MPI_COMM_WORLD通信中的所有其他进程都会接收数据。此技术适用于小型输入文件。对于较大的文件输入或输出,有多种方法可以执行并行文件操作。并行输入和输出的复杂世界将在第 16 章中讨论。
MPI_Bcast takes a pointer for the first argument, so when sending a scalar variable, we send the reference by using the ampersand (&) operator to get the address of the variable. Then comes the count and the type to fully define the data to be sent. The next argument specifies the originating process. It is 0 in both of these calls because that is the rank where the data resides. All other processes in the MPI_COMM_WORLD communication then receive the data. This technique is for small input files. For larger file input or output, there are ways to conduct parallel file operations. The complex world of parallel input and output is discussed in chapter 16.
Section 5.7 中讨论的归约模式是最重要的并行计算模式之一。缩减操作如中上部的图 8.4 所示。Fortran 数组语法缩减的一个示例是 xsum = sum(x(:)),其中 Fortran sum 内部函数对 x 数组求和,并将其放入标量变量 xsum 中。MPI 缩减调用采用数组或多维数组,并将值组合成标量结果。在缩减期间可以执行许多操作。最常见的是
The reduction pattern, discussed in section 5.7, is one of the most important parallel computing patterns. The reduction operation is shown in figure 8.4 in the upper middle. An example of the reduction in Fortran array syntax is xsum = sum(x(:)), where the Fortran sum intrinsic sums the x array and puts it in the scalar variable xsum. The MPI reduction calls take an array or multi-dimensional array and combine the values into a scalar result. There are many operations that can be done during the reduction. The most common are
下面的清单显示了如何使用 MPI_Reduce 从每个进程中获取变量的最小值、最大值和平均值。
The following listing shows how we can use MPI_Reduce to get the minimum, maximum, and average of a variable from every process.
Listing 8.11 使用 reductions 获得 min、max 和 avg 计时器结果
Listing 8.11 Using reductions to get min, max, and avg timer results
SynchronizedTimer/SynchronizedTimer2.c
1 #include <mpi.h>
2 #include <unistd.h>
3 #include <stdio.h>
4 int main(int argc, char *argv[])
5 {
6 double start_time, main_time, min_time, max_time, avg_time;
7
8 MPI_Init(&argc, &argv);
9 int rank, nprocs;
10 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
11 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
12
13 MPI_Barrier(MPI_COMM_WORLD); ❶
14 start_time = MPI_Wtime(); ❶
15
16 sleep(30); ❷
17
18 main_time = MPI_Wtime() - start_time; ❸
19 MPI_Reduce(&main_time, &max_time, 1, ❹
MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); ❹
20 MPI_Reduce(&main_time, &min_time, 1, ❹
MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD); ❹
21 MPI_Reduce(&main_time, &avg_time, 1, ❹
MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); ❹
22 if (rank == 0)
printf("Time for work is Min: %lf Max: %lf Avg: %lf seconds\n",
23 min_time, max_time, avg_time/nprocs);
24
25 MPI_Finalize();
26 return 0;
27 }SynchronizedTimer/SynchronizedTimer2.c
1 #include <mpi.h>
2 #include <unistd.h>
3 #include <stdio.h>
4 int main(int argc, char *argv[])
5 {
6 double start_time, main_time, min_time, max_time, avg_time;
7
8 MPI_Init(&argc, &argv);
9 int rank, nprocs;
10 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
11 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
12
13 MPI_Barrier(MPI_COMM_WORLD); ❶
14 start_time = MPI_Wtime(); ❶
15
16 sleep(30); ❷
17
18 main_time = MPI_Wtime() - start_time; ❸
19 MPI_Reduce(&main_time, &max_time, 1, ❹
MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); ❹
20 MPI_Reduce(&main_time, &min_time, 1, ❹
MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD); ❹
21 MPI_Reduce(&main_time, &avg_time, 1, ❹
MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); ❹
22 if (rank == 0)
printf("Time for work is Min: %lf Max: %lf Avg: %lf seconds\n",
23 min_time, max_time, avg_time/nprocs);
24
25 MPI_Finalize();
26 return 0;
27 }
❶ Synchronizes all the processes so these start at about the same time
❸ Gets the timer value and subtracts the starting value to get the elapsed time
❹ 使用 reduction 调用来计算最大、最小和平均时间
❹ Uses reduction calls to compute the max, min, and average time
缩减结果(在本例中为最大值)存储在 rank 0(MPI_Reduce 调用中的参数 6)上,在本例中是主进程。如果我们只想在主进程中打印出来,这将是合适的。但是,如果我们希望所有进程都具有该值,我们将使用 MPI_Allreduce 例程。
The reduction result, the maximum in this case, is stored on rank 0 (argument 6 in the MPI_Reduce call), which in this case is the main process. If we wanted to just print it out on the main process, this would be appropriate. But if we wanted all of the processes to have the value, we would use the MPI_Allreduce routine.
您还可以定义自己的运算符。我们将使用我们一直在使用的 Kahan 增强精度求和的例子,该 summation 在 5.7 节中首次介绍。分布式内存并行环境中的挑战是跨进程等级进行 Kahan 求和。我们首先查看下面清单中的主程序,然后再查看清单 8.13 和 8.14 中程序的另外两个部分。
You can also define your own operator. We’ll use the example of the Kahan enhanced-precision summation we have been working with and first introduced in section 5.7. The challenge in a distributed memory parallel environment is to carry the Kahan summation across process ranks. We start by looking at the main program in the following listing before looking at two other parts of the program in listings 8.13 and 8.14.
Listing 8.12 An MPI version of the Kahan summation
GlobalSums/globalsums.c
57 int main(int argc, char *argv[])
58 {
59 MPI_Init(&argc, &argv);
60 int rank, nprocs;
61 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
62 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
63
64 init_kahan_sum(); ❶
65
66 if (rank == 0) printf("MPI Kahan tests\n");
67
68 for (int pow_of_two = 8; pow_of_two < 31; pow_of_two++){
69 long ncells = (long)pow((double)2,(double)pow_of_two);
70
71 int nsize;
72 double accurate_sum;
73 double *local_energy = ❷
init_energy(ncells, &nsize, ❷
&accurate_sum); ❷
74
75 struct timespec cpu_timer;
76 cpu_timer_start(&cpu_timer);
77
78 double test_sum = ❸
global_kahan_sum(nsize, local_energy); ❸
79
80 double cpu_time = cpu_timer_stop(cpu_timer);
81
82 if (rank == 0){
83 double sum_diff = test_sum-accurate_sum;
84 printf("ncells %ld log %d acc sum %-17.16lg sum %-17.16lg ",
85 ncells,(int)log2((double)ncells),accurate_sum,test_sum);
86 printf("diff %10.4lg relative diff %10.4lg runtime %lf\n",
87 sum_diff,sum_diff/accurate_sum, cpu_time);
88 }
89
90 free(local_energy);
91 }
92
93 MPI_Type_free(&EPSUM_TWO_DOUBLES); ❹
94 MPI_Op_free(&KAHAN_SUM); ❹
95 MPI_Finalize();
96 return 0;
97 }GlobalSums/globalsums.c
57 int main(int argc, char *argv[])
58 {
59 MPI_Init(&argc, &argv);
60 int rank, nprocs;
61 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
62 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
63
64 init_kahan_sum(); ❶
65
66 if (rank == 0) printf("MPI Kahan tests\n");
67
68 for (int pow_of_two = 8; pow_of_two < 31; pow_of_two++){
69 long ncells = (long)pow((double)2,(double)pow_of_two);
70
71 int nsize;
72 double accurate_sum;
73 double *local_energy = ❷
init_energy(ncells, &nsize, ❷
&accurate_sum); ❷
74
75 struct timespec cpu_timer;
76 cpu_timer_start(&cpu_timer);
77
78 double test_sum = ❸
global_kahan_sum(nsize, local_energy); ❸
79
80 double cpu_time = cpu_timer_stop(cpu_timer);
81
82 if (rank == 0){
83 double sum_diff = test_sum-accurate_sum;
84 printf("ncells %ld log %d acc sum %-17.16lg sum %-17.16lg ",
85 ncells,(int)log2((double)ncells),accurate_sum,test_sum);
86 printf("diff %10.4lg relative diff %10.4lg runtime %lf\n",
87 sum_diff,sum_diff/accurate_sum, cpu_time);
88 }
89
90 free(local_energy);
91 }
92
93 MPI_Type_free(&EPSUM_TWO_DOUBLES); ❹
94 MPI_Op_free(&KAHAN_SUM); ❹
95 MPI_Finalize();
96 return 0;
97 }
❶ Initializes the new MPI data type and creates a new operator
❷ Gets a distributed array to work with
❸ Calculates the Kahan summation of the energy array across all processes
❹ Frees the custom data type and operator
主程序显示,新的 MPI 数据类型在程序开始时创建一次,并在程序结束时释放一次,然后再MPI_Finalize。执行全局 Kahan 求和的调用在循环中执行多次,其中数据大小增加 2 的幂次方。现在让我们看看下一个清单,看看需要做什么来初始化新的数据类型和运算符。
The main program shows that the new MPI data type is created once at the start of the program and freed at the end, before MPI_Finalize. The call to perform the global Kahan summation is done multiple times within the loop, where the data size is increased by powers of two. Now let’s look at the next listing to see what needs to be done to initialize the new data type and operator.
示例 8.13 初始化新的 MPI 数据类型和运算符以进行 Kahan 求和
Listing 8.13 Initializing new MPI data type and operator for Kahan summation
GlobalSums/globalsums.c
14 struct esum_type{ ❶
15 double sum; ❶
16 double correction; ❶
17 }; ❶
18
19 MPI_Datatype EPSUM_TWO_DOUBLES; ❷
20 MPI_Op KAHAN_SUM; ❸
21
22 void kahan_sum(struct esum_type * in,
struct esum_type * inout, int *len,
23 MPI_Datatype *EPSUM_TWO_DOUBLES) ❹
24 {
25 double corrected_next_term, new_sum;
26 corrected_next_term = in->sum + (in->correction + inout->correction);
27 new_sum = inout->sum + corrected_next_term;
28 inout->correction = corrected_next_term - (new_sum - inout->sum);
29 inout->sum = new_sum;
30 }
31
32 void init_kahan_sum(void){
33 MPI_Type_contiguous(2, MPI_DOUBLE, ❺
&EPSUM_TWO_DOUBLES); ❺
34 MPI_Type_commit(&EPSUM_TWO_DOUBLES); ❺
35
36 int commutative = 1; ❻
37 MPI_Op_create((MPI_User_function *)kahan_sum, ❻
commutative, &KAHAN_SUM); ❻
38 }GlobalSums/globalsums.c
14 struct esum_type{ ❶
15 double sum; ❶
16 double correction; ❶
17 }; ❶
18
19 MPI_Datatype EPSUM_TWO_DOUBLES; ❷
20 MPI_Op KAHAN_SUM; ❸
21
22 void kahan_sum(struct esum_type * in,
struct esum_type * inout, int *len,
23 MPI_Datatype *EPSUM_TWO_DOUBLES) ❹
24 {
25 double corrected_next_term, new_sum;
26 corrected_next_term = in->sum + (in->correction + inout->correction);
27 new_sum = inout->sum + corrected_next_term;
28 inout->correction = corrected_next_term - (new_sum - inout->sum);
29 inout->sum = new_sum;
30 }
31
32 void init_kahan_sum(void){
33 MPI_Type_contiguous(2, MPI_DOUBLE, ❺
&EPSUM_TWO_DOUBLES); ❺
34 MPI_Type_commit(&EPSUM_TWO_DOUBLES); ❺
35
36 int commutative = 1; ❻
37 MPI_Op_create((MPI_User_function *)kahan_sum, ❻
commutative, &KAHAN_SUM); ❻
38 }
❶ Defines an esum_type structure to hold the sum and correction term
❷ Declares a new MPI data type composed of two doubles
❸ Declares a new Kahan summation operator
❹ Defines a function for the new operator using a predefined signature
❺ Creates the type and commits it
❻ Creates the new operator and commits it
我们首先通过在第 33 行中组合两个基本数据类型来创建新的数据类型 EPSUM_TWO_DOUBLES MPI_DOUBLE。我们必须在第 19 行的例程外部声明该类型,以便 summation 例程可以使用它。要创建新运算符,我们首先在第 22-30 行中编写用作运算符的函数。然后我们使用 esum_type 传入和传出 double 值。我们还需要传入长度和它将作为新 EPSUM_TWO_DOUBLES 类型进行操作的数据类型。
We first create the new data type, EPSUM_TWO_DOUBLES, by combining two of the basic MPI_DOUBLE data type in line 33. We have to declare the type outside the routine at line 19 so that it is available to use by the summation routine. To create the new operator, we first write the function to use as the operator in lines 22-30. We then use esum_type to pass both double values in and back out. We also need to pass in the length and the data type that it will operate on as the new EPSUM_TWO_DOUBLES type.
在创建 Kahan 和缩减运算符的过程中,我们向您展示了如何创建新的 MPI 数据类型和新的 MPI 缩减运算符。现在,让我们继续实际计算跨 MPI 排名的数组的全局和,如下面的清单所示。
In the process of creating a Kahan sum reduction operator, we showed you how to create a new MPI data type and a new MPI reduction operator. Now let’s move on to actually calculating the global sum of the array across MPI ranks as the following listing shows.
Listing 8.14 Performing an MPI Kahan summation
GlobalSums/globalsums.c
40 double global_kahan_sum(int nsize, double *local_energy){
41 struct esum_type local, global;
42 local.sum = 0.0; ❶
43 local.correction = 0.0; ❶
44
45 for (long i = 0; i < nsize; i++) { ❷
46 double corrected_next_term = ❷
local_energy[i] + local.correction; ❷
47 double new_sum = ❷
local.sum + local.correction; ❷
48 local.correction = corrected_next_term - ❷
(new_sum - local.sum); ❷
49 local.sum = new_sum; ❷
50 } ❷
51
52 MPI_Allreduce(&local, &global, 1, EPSUM_TWO_DOUBLES, KAHAN_SUM, ❸
MPI_COMM_WORLD);
53
54 return global.sum;
55 }GlobalSums/globalsums.c
40 double global_kahan_sum(int nsize, double *local_energy){
41 struct esum_type local, global;
42 local.sum = 0.0; ❶
43 local.correction = 0.0; ❶
44
45 for (long i = 0; i < nsize; i++) { ❷
46 double corrected_next_term = ❷
local_energy[i] + local.correction; ❷
47 double new_sum = ❷
local.sum + local.correction; ❷
48 local.correction = corrected_next_term - ❷
(new_sum - local.sum); ❷
49 local.sum = new_sum; ❷
50 } ❷
51
52 MPI_Allreduce(&local, &global, 1, EPSUM_TWO_DOUBLES, KAHAN_SUM, ❸
MPI_COMM_WORLD);
53
54 return global.sum;
55 }
❶ Initializes both members of the esum_type to zero
❷ Performs the on-process Kahan summation
❸ Performs the reduction with the new KAHAN_SUM operator
现在计算全局 Kahan 总和相对容易。我们可以按照 5.7 节所示进行局部 Kahan 和。但是我们必须在第 52 行添加 MPI_Allreduce 才能获得全局结果。在这里,我们将 allreduce 操作定义为在所有处理器上以结果结束,如右上角的图 8.4 所示。
Calculating the global Kahan summation is relatively easy now. We can do the local Kahan sum as shown in section 5.7. But we have to add MPI_Allreduce at line 52 to get the global result. Here, we defined the allreduce operation to end with the result on all processors as shown in figure 8.4 in the upper right.
收集操作可以描述为整理操作,其中来自所有处理器的数据被汇集在一起并堆叠到单个数组中,如下中心的图 8.4 所示。您可以使用此集体通信调用来整理程序中控制台的输出。到目前为止,您应该已经注意到,从 MPI 程序的多个列打印的输出是以随机顺序显示的,从而产生混乱、混乱的混乱。让我们看看一种更好的方法来处理这个问题,以便唯一的输出来自主进程。通过仅打印主进程的输出,顺序将是正确的。下一个清单显示了一个示例程序,该程序从所有进程中获取数据,并将其打印出来,以一个漂亮、有序的输出。
A gather operation can be described as a collate operation, where data from all processors is brought together and stacked into a single array as shown in figure 8.4 in the lower center. You can use this collective communication call to bring order to the output to the console from your program. By now, you should have noticed that the output printed from multiple ranks of an MPI program comes out in random order, producing a jumbled, confusing mess. Let’s look at a better way to handle this so that the only output is from the main process. By printing the output from only the main process, the order will be correct. The next listing shows a sample program that gets data from all of the processes and prints it out in a nice, orderly output.
Listing 8.15 Using a gather to print debug messages
DebugPrintout/DebugPrintout.c
1 #include <stdio.h>
2 #include <time.h>
3 #include <unistd.h>
4 #include <mpi.h>
5 #include "timer.h"
6 int main(int argc, char *argv[])
7 {
8 int rank, nprocs;
9 double total_time;
10 struct timespec tstart_time;
11
12 MPI_Init(&argc, &argv);
13 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
14 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
15
16 cpu_timer_start(&tstart_time);
17 sleep(30); ❶
18 total_time += cpu_timer_stop(tstart_time);
19
20 double times[nprocs]; ❷
21 MPI_Gather(&total_time, 1, MPI_DOUBLE, ❸
times, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); ❸
22 if (rank == 0) { ❹
23 for (int i=0; i<nprocs; i++){ ❺
24 printf("%d:Work took %lf secs\n", ❻
i, times[i]); ❻
25 }
26 }
27
28 MPI_Finalize();
29 return 0;
30 }DebugPrintout/DebugPrintout.c
1 #include <stdio.h>
2 #include <time.h>
3 #include <unistd.h>
4 #include <mpi.h>
5 #include "timer.h"
6 int main(int argc, char *argv[])
7 {
8 int rank, nprocs;
9 double total_time;
10 struct timespec tstart_time;
11
12 MPI_Init(&argc, &argv);
13 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
14 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
15
16 cpu_timer_start(&tstart_time);
17 sleep(30); ❶
18 total_time += cpu_timer_stop(tstart_time);
19
20 double times[nprocs]; ❷
21 MPI_Gather(&total_time, 1, MPI_DOUBLE, ❸
times, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); ❸
22 if (rank == 0) { ❹
23 for (int i=0; i<nprocs; i++){ ❺
24 printf("%d:Work took %lf secs\n", ❻
i, times[i]); ❻
25 }
26 }
27
28 MPI_Finalize();
29 return 0;
30 }
❶ Gets unique values on each process for our example
❷ Needs an array to collect all the times
❸ 使用 gather 将所有值置于 process zero
❸ Uses a gather to bring all the values to process zero
❹ Only prints on the main process
❺ Loops over the processes for the print
❻ Prints the time for each process
MPI_Gather采用描述数据源的标准三元组。我们需要使用 & 符号来获取标量变量 total_time 的地址。destination 也是一个 triplet,目标数组是 times。数组已经是一个地址,因此不需要 & 符号。完成收集以处理 MPI 世界通信组的 0。从那里,它需要一个循环来打印每个进程的时间。我们在每一行前面加上一个格式为 #: 的数字,以便清楚地知道输出引用了哪个进程。
MPI_Gather takes the standard triplet describing the data source. We need to use the ampersand to get the address of the scalar variable total_time. The destination is also a triplet with the destination array of times. An array is already an address, so no ampersand is needed. The gather is done to process 0 of the MPI world communication group. From there, it requires a loop to print the time for each process. We prepend every line with a number in the format #: so that it is clear which process the output refers to.
scatter 操作(如左下角的图 8.4 所示)与 gather 操作相反。对于此操作,数据将从一个进程发送到通信组中的所有其他进程。分散操作最常见的用途是并行策略,将数据数组分发给其他进程进行工作。这是由 MPI_Scatter 和 MPI_Scatterv 例程提供的。下面的清单显示了实现。
The scatter operation, shown in figure 8.4 in the lower left, is the opposite of the gather operation. For this operation, the data is sent from one process to all the others in the communication group. The most common use for a scattering operation is in the parallel strategy distributing data arrays out to other processes for work. This is provided by the MPI_Scatter and MPI_Scatterv routines. The following listing shows the implementation.
清单 8.16 使用 scatter 分发数据,使用 gather 带回数据
Listing 8.16 Using scatter to distribute data and gather to bring it back
ScatterGather/ScatterGather.c
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <mpi.h>
4 int main(int argc, char *argv[])
5 {
6 int rank, nprocs, ncells = 100000;
7
8 MPI_Init(&argc, &argv);
9 MPI_Comm comm = MPI_COMM_WORLD;
10 MPI_Comm_rank(comm, &rank);
11 MPI_Comm_size(comm, &nprocs);
12
13 long ibegin = ncells *(rank )/nprocs; ❶
14 long iend = ncells *(rank+1)/nprocs; ❶
15 int nsize = (int)(iend-ibegin); ❶
16
17 double *a_global, *a_test;
18 if (rank == 0) {
19 a_global = (double *) ❷
malloc(ncells*sizeof(double)); ❷
20 for (int i=0; i<ncells; i++) { ❷
21 a_global[i] = (double)i; ❷
22 } ❷
23 }
24
25 int nsizes[nprocs], offsets[nprocs]; ❸
26 MPI_Allgather(&nsize, 1, MPI_INT, nsizes, ❸
1, MPI_INT, comm); ❸
27 offsets[0] = 0; ❸
28 for (int i = 1; i<nprocs; i++){ ❸
29 offsets[i] = offsets[i-1] + nsizes[i-1]; ❸
30 } ❸
31
32 double *a = (double *) ❹
malloc(nsize*sizeof(double)); ❹
33 MPI_Scatterv(a_global, nsizes, offsets, ❹
34 MPI_DOUBLE, a, nsize, MPI_DOUBLE, 0, comm); ❹
35
36 for (int i=0; i<nsize; i++){ ❺
37 a[i] += 1.0; ❺
38 } ❺
39
40 if (rank == 0) {
41 a_test = (double *) ❻
malloc(ncells*sizeof(double)); ❻
42 }
43
44 MPI_Gatherv(a, nsize, MPI_DOUBLE, ❻
45 a_test, nsizes, offsets, ❻
MPI_DOUBLE, 0, comm); ❻
46
47 if (rank == 0){
48 int ierror = 0;
49 for (int i=0; i<ncells; i++){
50 if (a_test[i] != a_global[i] + 1.0) {
51 printf("Error: index %d a_test %lf a_global %lf\n",
52 i,a_test[i],a_global[i]);
53 ierror++;
54 }
55 }
56 printf("Report: Correct results %d errors %d\n",
ncells-ierror,ierror);
57 }
58
59 free(a);
60 if (rank == 0) {
61 free(a_global);
62 free(a_test);
63 }
64
65 MPI_Finalize();
66 return 0;
67 }ScatterGather/ScatterGather.c
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <mpi.h>
4 int main(int argc, char *argv[])
5 {
6 int rank, nprocs, ncells = 100000;
7
8 MPI_Init(&argc, &argv);
9 MPI_Comm comm = MPI_COMM_WORLD;
10 MPI_Comm_rank(comm, &rank);
11 MPI_Comm_size(comm, &nprocs);
12
13 long ibegin = ncells *(rank )/nprocs; ❶
14 long iend = ncells *(rank+1)/nprocs; ❶
15 int nsize = (int)(iend-ibegin); ❶
16
17 double *a_global, *a_test;
18 if (rank == 0) {
19 a_global = (double *) ❷
malloc(ncells*sizeof(double)); ❷
20 for (int i=0; i<ncells; i++) { ❷
21 a_global[i] = (double)i; ❷
22 } ❷
23 }
24
25 int nsizes[nprocs], offsets[nprocs]; ❸
26 MPI_Allgather(&nsize, 1, MPI_INT, nsizes, ❸
1, MPI_INT, comm); ❸
27 offsets[0] = 0; ❸
28 for (int i = 1; i<nprocs; i++){ ❸
29 offsets[i] = offsets[i-1] + nsizes[i-1]; ❸
30 } ❸
31
32 double *a = (double *) ❹
malloc(nsize*sizeof(double)); ❹
33 MPI_Scatterv(a_global, nsizes, offsets, ❹
34 MPI_DOUBLE, a, nsize, MPI_DOUBLE, 0, comm); ❹
35
36 for (int i=0; i<nsize; i++){ ❺
37 a[i] += 1.0; ❺
38 } ❺
39
40 if (rank == 0) {
41 a_test = (double *) ❻
malloc(ncells*sizeof(double)); ❻
42 }
43
44 MPI_Gatherv(a, nsize, MPI_DOUBLE, ❻
45 a_test, nsizes, offsets, ❻
MPI_DOUBLE, 0, comm); ❻
46
47 if (rank == 0){
48 int ierror = 0;
49 for (int i=0; i<ncells; i++){
50 if (a_test[i] != a_global[i] + 1.0) {
51 printf("Error: index %d a_test %lf a_global %lf\n",
52 i,a_test[i],a_global[i]);
53 ierror++;
54 }
55 }
56 printf("Report: Correct results %d errors %d\n",
ncells-ierror,ierror);
57 }
58
59 free(a);
60 if (rank == 0) {
61 free(a_global);
62 free(a_test);
63 }
64
65 MPI_Finalize();
66 return 0;
67 }
❶ Computes the size of the array on every process
❷ Sets up data on the main process
❸ Gets the sizes and offsets into global arrays for communication
❹ Distributes the data onto the other processes
❻ Returns array data to the main process, perhaps for output
我们首先需要计算每个进程的数据大小。所需的分布应尽可能相等。第 13-15 行显示了一种计算大小的简单方法,使用简单的整数算术。现在我们需要全局数组,但我们只需要在主进程上。因此,我们在第 18-23 行中分配并设置它。为了分发或收集数据,必须知道所有进程的大小和偏移量。我们在第 25-30 行中看到了典型的计算结果。实际的分散是通过第 32-34 行的 MPI_Scatterv完成的。数据源使用参数 buffer、counts、offset 和 data type 进行描述。目标使用标准三元组处理。然后,将发送数据的源排名指定为排名 0。最后,最后一个参数是 comm,即将接收数据的通信组。
We first need to calculate the size of the data on each process. The desired distribution is to be as equal as possible. A simple way to calculate the size is shown in lines 13-15, using simple integer arithmetic. Now we need the global array, but we only need it on the main process. So we allocate and set it up on this process in lines 18-23. In order to distribute or gather the data, the sizes and offsets for all processes must be known. We see the typical calculation for this in lines 25-30. The actual scatter is done with an MPI_Scatterv on lines 32-34. The data source is described with the arguments buffer, counts, offsets, and the data type. The destination is handled with the standard triplet. Then the source rank that will send the data is specified as rank 0. Finally, the last argument is comm, the communication group that will receive the data.
MPI_Gatherv 执行相反的操作,如图 8.4 所示。我们只需要在主进程上使用全局数组,因此它只在第 40-42 行分配。MPI_Gatherv 的参数以带有标准三元组的源的描述开始。然后,使用与散点中使用的相同的四个参数来描述目标。目标排名是下一个参数,后跟通信组。
MPI_Gatherv does the opposite operation, as shown in figure 8.4. We only need the global array on the main process, and so it is only allocated there on lines 40-42. The arguments to MPI_Gatherv start with the description of the source with the standard triplet. Then the destination is described with the same four arguments as were used in the scatter. The destination rank is the next argument, followed by the communication group.
需要注意的是,MPI_Gatherv 调用中使用的 sizes 和 offset 都是 integer 类型。这限制了可以处理的数据的大小。有人尝试将数据类型更改为 long 数据类型,以便在 MPI 标准的版本 3 中可以处理更大的数据大小。它没有被批准,因为它会破坏太多的应用程序。请继续关注在下一个 MPI 标准之一中为长整数类型提供支持的新调用。
It should be noted that the sizes and offsets used in the MPI_Gatherv call are all of integer type. This limits the size of the data that can be handled. There was an attempt to change the data type to the long data type so larger data sizes could be handled in version 3 of the MPI standard. It was not approved because it would break too many applications. Stay tuned for the addition of new calls that provide support for a long integer type in one of the next MPI standards.
1.5 节中定义的数据并行策略是并行应用程序中最常见的方法。在本节中,我们将介绍这种方法的几个示例。首先,我们将看一个不需要通信的流三元组的简单情况。然后,我们将看看更典型的 ghost cell 交换技术,这些技术用于将分布到每个进程的细分域链接在一起。
The data parallel strategy, defined in section 1.5, is the most common approach in parallel applications. We’ll look at a few examples of this approach in this section. First, we’ll look at a simple case of the stream triad where no communication is necessary. Then we’ll look at the more typical ghost cell exchange techniques used to link together the subdivided domains distributed to each process.
STREAM 三元组是第 3.2.4 节中介绍的带宽测试基准代码。此版本使用 MPI 让更多进程在节点上工作,并且可能在多个节点上工作。拥有更多进程的目的是查看使用所有处理器时节点的最大带宽是多少。这为更复杂的应用程序提供了目标带宽。如清单 8.17 所示,代码很简单,因为 rank 之间不需要通信。仅在主进程上报告计时。您可以先在一个处理器上运行它,然后再在节点上的所有处理器上运行它。您是否获得了您期望的处理器增加所期望的完全并行加速?系统内存带宽对你的加速有多大限制?
The STREAM Triad is a bandwidth testing benchmark code introduced in section 3.2.4. This version uses MPI to get more processes working on the node and, possibly, on multiple nodes. The purpose of having more processes is to see what the maximum bandwidth is for the node when all processors are used. This gives a target bandwidth to aim for with more complicated applications. As listing 8.17 shows, the code is simple because no communication between ranks is required. The timing is only reported on the main process. You can run this first on one processor and then on all the processors on your node. Do you get the full parallel speedup that you would expect from the increase in processors? How much does the system memory bandwidth limit your speedup?
Listing 8.17 STREAM 三元组的 MPI 版本
Listing 8.17 The MPI version of the STREAM Triad
StreamTriad/StreamTriad.c 1 #include <stdio.h> 2 #include <stdlib.h> 3 #include <time.h> 4 #include <mpi.h> 5 #include "timer.h" 6 7 #define NTIMES 16 8 #define STREAM_ARRAY_SIZE 80000000 ❶ 9 10 int main(int argc, char *argv[]){ 11 12 MPI_Init(&argc, &argv); 13 14 int nprocs, rank; 15 MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 16 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 17 int ibegin = STREAM_ARRAY_SIZE *(rank )/nprocs; 18 int iend = STREAM_ARRAY_SIZE *(rank+1)/nprocs; 19 int nsize = iend-ibegin; 20 double *a = malloc(nsize * sizeof(double)); 21 double *b = malloc(nsize * sizeof(double)); 22 double *c = malloc(nsize * sizeof(double)); 23 24 struct timespec tstart; 25 double scalar = 3.0, time_sum = 0.0; ❷ 26 for (int i=0; i<nsize; i++) { ❷ 27 a[i] = 1.0; ❷ 28 b[i] = 2.0; ❷ 29 } ❷ 30 31 for (int k=0; k<NTIMES; k++){ 32 cpu_timer_start(&tstart); 33 for (int i=0; i<nsize; i++){ ❸ 34 c[i] = a[i] + scalar*b[i]; ❸ 35 } ❸ 36 time_sum += cpu_timer_stop(tstart); 37 c[1]=c[2]; ❹ 38 } 39 40 free(a); 41 free(b); 42 free(c); 43 44 if (rank == 0) printf("Average runtime is %lf msecs\n", time_sum/NTIMES); 45 MPI_Finalize(); 46 return(0); 47 }
StreamTriad/StreamTriad.c 1 #include <stdio.h> 2 #include <stdlib.h> 3 #include <time.h> 4 #include <mpi.h> 5 #include "timer.h" 6 7 #define NTIMES 16 8 #define STREAM_ARRAY_SIZE 80000000 ❶ 9 10 int main(int argc, char *argv[]){ 11 12 MPI_Init(&argc, &argv); 13 14 int nprocs, rank; 15 MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 16 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 17 int ibegin = STREAM_ARRAY_SIZE *(rank )/nprocs; 18 int iend = STREAM_ARRAY_SIZE *(rank+1)/nprocs; 19 int nsize = iend-ibegin; 20 double *a = malloc(nsize * sizeof(double)); 21 double *b = malloc(nsize * sizeof(double)); 22 double *c = malloc(nsize * sizeof(double)); 23 24 struct timespec tstart; 25 double scalar = 3.0, time_sum = 0.0; ❷ 26 for (int i=0; i<nsize; i++) { ❷ 27 a[i] = 1.0; ❷ 28 b[i] = 2.0; ❷ 29 } ❷ 30 31 for (int k=0; k<NTIMES; k++){ 32 cpu_timer_start(&tstart); 33 for (int i=0; i<nsize; i++){ ❸ 34 c[i] = a[i] + scalar*b[i]; ❸ 35 } ❸ 36 time_sum += cpu_timer_stop(tstart); 37 c[1]=c[2]; ❹ 38 } 39 40 free(a); 41 free(b); 42 free(c); 43 44 if (rank == 0) printf("Average runtime is %lf msecs\n", time_sum/NTIMES); 45 MPI_Finalize(); 46 return(0); 47 }
❶ Large enough to force into main memory
❹ Keeps the compiler from optimizing out the loop
Ghost cells 是我们用来链接相邻处理器上的网格的机制。这些用于缓存来自相邻处理器的值,以便减少需要的通信。ghost cell 技术是在 MPI 中实现分布式内存并行性的唯一最重要的方法。
Ghost cells are the mechanism that we use to link the meshes on adjacent processors. These are used to cache values from adjacent processors so that fewer communications are needed. The ghost cell technique is the single most important method for enabling distributed memory parallelism in MPI.
我们来谈谈光晕和幽灵细胞的术语。甚至在并行处理时代之前,网格周围的单元区域就经常被用来实现边界条件。这些边界条件可以是反射、流入、流出或周期性。为了提高效率,程序员希望避免在主计算循环中使用 if 语句。为此,他们在网格周围添加了单元格,并在主计算循环之前将这些单元格设置为适当的值。这些细胞看起来像一个光晕,所以这个名字就被保留了下来。Halo 单元是围绕计算网格的任意一组单元,无论其用途如何。然后,域边界 halo 是用于施加一组特定边界条件的 halo 单元。
Let’s talk a little bit about the terminology of halos and ghost cells. Even before the age of parallel processing, a region of cells surrounding the mesh was often used to implement boundary conditions. These boundary conditions could be reflective, inflow, outflow, or periodic. For efficiency, programmers wanted to avoid if statements in the main computational loop. To do this, they added cells surrounding the mesh and set those to appropriate values before the main computational loop. These cells had the appearance of a halo, so the name stuck. Halo cells are any set of cells surrounding a computational mesh regardless of their purpose. A domain-boundary halo is then halo cells used for imposing a specific set of boundary conditions.
一旦应用程序并行化,就会添加一个类似的单元外部区域来保存来自相邻网格的值。这些单元不是真正的单元,而只是作为降低通信成本的辅助工具而存在。因为这些不是真的,所以这些很快就被命名为幽灵细胞。ghost cell 的真实数据位于相邻的处理器上,而本地副本只是一个 ghost 值。幽灵细胞也看起来像光晕,也被称为光晕细胞。Ghost cell 更新或交换是指 ghost cell 的更新,仅当您需要从相邻进程更新实际值时,才需要并行、多进程运行。
Once applications were parallelized, a similar outer region of cells was added to hold values from the neighboring meshes. These cells are not real cells but only exist as an aid to reduce communication costs. Because these are not real, these were soon given the name ghost cells. The real data for a ghost cell is on the adjacent processor and the local copy is just a ghost value. The ghost cells also look like halos and are also referred to as halo cells. Ghost cell updates or exchanges refer to the updating of the ghost cells and are only needed for parallel, multi-process runs when you need updates of real values from adjacent processes.
需要为串行和并行运行执行边界条件。之所以存在混淆,是因为这些操作通常被称为 halo 更新,尽管目前尚不清楚其具体含义。在我们的术语中,halo 更新是指域边界更新和 ghost cell 更新。为了优化 MPI 通信,我们只需要查看 ghost cell 更新或交换,暂时搁置边界条件计算。
The boundary conditions need to be done for both serial and parallel runs. Confusion exists because these operations are often referred to as halo updates, although it’s unclear exactly what is meant. In our terminology, halo updates refers to both the domain boundary updates and ghost cell updates. For optimizing MPI communication, we only need to look at the ghost cell updates or exchanges and put aside the boundary conditions calculations for the present.
现在让我们看看如何为每个进程上的本地网格的边界设置虚影单元,并执行子域之间的通信。通过使用 ghost cells,所需的通信被分组为更少的通信,而不是每次需要另一个进程的 cell 的值时都进行一次通信。这是使数据并行方法性能良好的最常见技术。在 ghost cell updates 的实现中,我们将演示 MPI_Pack 例程的使用,并使用简单的逐元数组分配加载通信缓冲区。在后面的部分中,我们还将了解如何使用 MPI 拓扑调用进行设置和通信,对 MPI 数据类型进行相同的通信。
Let’s now look at how to set up ghost cells for the borders of the local mesh on each process and perform the communication between the subdomains. By using ghost cells, the needed communications are grouped into a fewer number of communications than if a single communication is done every time a cell’s value is needed from another process. This is the most common technique to make the data parallel approach perform well. In the implementations of the ghost cell updates, we’ll demonstrate the use of the MPI_Pack routine and load a communication buffer with a simple cell-by-cell array assignment. In later sections, we’ll also see how to do the same communication with MPI data types, using the MPI topology calls for setup and communication.
一旦我们在数据并行代码中实现了 ghost cell 更新,大部分所需的通信就得到了处理。这会将提供并行度的代码隔离到应用程序的一小部分中。这一小段代码对于优化并行效率非常重要。让我们看看这个功能的一些实现,从清单 8.18 中的设置和清单 8.19 中模板循环完成的工作开始。您可能希望查看 GhostExchange/GhostExchange_Pack 目录中该章的示例代码 https://github.com/ EssentialsOfParallelComputing/Chapter8。
Once we implement the ghost cell updates in a data parallel code, most of the needed communication is handled. This isolates the code that provides the parallelism into a small section of the application. This small section of the code is important to optimize for parallel efficiency. Let’s look at some implementations of this functionality, starting with the setup in listing 8.18 and the work done by the stencil loops in listing 8.19. You may want to look at the full code in the GhostExchange/GhostExchange_Pack directory of the example code for the chapter at https://github.com/ EssentialsOfParallelComputing/Chapter8.
Listing 8.18 在 2D 网格中进行 ghost cell 交换的设置
Listing 8.18 Setup for ghost cell exchanges in a 2D mesh
GhostExchange/GhostExchange_Pack/GhostExchange.cc 30 int imax = 2000, jmax = 2000; ❶ 31 int nprocx = 0, nprocy = 0; ❷ 32 int nhalo = 2, corners = 0; ❸ 33 int do_timing; ❹ .... 40 int xcoord = rank%nprocx; ❺ 41 int ycoord = rank/nprocx; ❺ 42 43 int nleft = (xcoord > 0 ) ? ❻ rank - 1 : MPI_PROC_NULL; ❻ 44 int nrght = (xcoord < nprocx-1) ? ❻ rank + 1 : MPI_PROC_NULL; ❻ 45 int nbot = (ycoord > 0 ) ? ❻ rank - nprocx : MPI_PROC_NULL; ❻ 46 int ntop = (ycoord < nprocy-1) ? ❻ rank + nprocx : MPI_PROC_NULL; ❻ 47 48 int ibegin = imax *(xcoord )/nprocx; ❼ 49 int iend = imax *(xcoord+1)/nprocx; ❼ 50 int isize = iend - ibegin; ❼ 51 int jbegin = jmax *(ycoord )/nprocy; ❼ 52 int jend = jmax *(ycoord+1)/nprocy; ❼ 53 int jsize = jend - jbegin; ❼
GhostExchange/GhostExchange_Pack/GhostExchange.cc 30 int imax = 2000, jmax = 2000; ❶ 31 int nprocx = 0, nprocy = 0; ❷ 32 int nhalo = 2, corners = 0; ❸ 33 int do_timing; ❹ .... 40 int xcoord = rank%nprocx; ❺ 41 int ycoord = rank/nprocx; ❺ 42 43 int nleft = (xcoord > 0 ) ? ❻ rank - 1 : MPI_PROC_NULL; ❻ 44 int nrght = (xcoord < nprocx-1) ? ❻ rank + 1 : MPI_PROC_NULL; ❻ 45 int nbot = (ycoord > 0 ) ? ❻ rank - nprocx : MPI_PROC_NULL; ❻ 46 int ntop = (ycoord < nprocy-1) ? ❻ rank + nprocx : MPI_PROC_NULL; ❻ 47 48 int ibegin = imax *(xcoord )/nprocx; ❼ 49 int iend = imax *(xcoord+1)/nprocx; ❼ 50 int isize = iend - ibegin; ❼ 51 int jbegin = jmax *(ycoord )/nprocy; ❼ 52 int jend = jmax *(ycoord+1)/nprocy; ❼ 53 int jsize = jend - jbegin; ❼
❶ 输入设置:-i <imax> -j <jmax> 是网格的大小。
❶ Input settings: -i <imax> -j <jmax> are the sizes of the grid.
❷ -x <nprocx> -y <nprocy> 是 x 和 y 方向的进程数。
❷ -x <nprocx> -y <nprocy> are the number of processes in x- and y-directions.
❸ -h <nhalo> -c 是 halo 单元格的数量,-c 包括角单元格。
❸ -h <nhalo> -c is the number of halo cells and -c includes corner cells.
❹ -t do_timing synchronizes timing.
❺ 进程的 xcoord 和 ycoord。行索引变化最快。
❺ xcoord and ycoord of processes. Row index varies fastest.
❻ Neighbor rank for each process for neighbor communication
❼ Size of computational domain for each process and the global begin and end index
我们为每个进程的局部大小加上 halo 的空间进行内存分配。为了使索引更简单一些,我们将内存索引偏移到从 -nhalo 开始,到 isize+nhalo 结束。然后,实际单元格始终从 0 到 isize-1,而不管光晕的宽度如何。
We do memory allocation for the local size plus room for the halos on each process. To make the indexing a little simpler, we offset the memory indexing to start at -nhalo and end at isize+nhalo. The real cells then are always from 0 to isize-1, regardless of the width of the halo.
以下几行显示了对特殊 malloc2D 的调用,其中包含两个附加参数,这些参数偏移数组寻址,以便数组的实部从 0,0 到 jsize,isize。这是通过一些指针算术完成的,该算法会移动每个指针的起始位置。
The following lines show a call to a special malloc2D with two additional arguments that offset the array addressing so that the real part of the array is from 0,0 to jsize,isize. This is done with some pointer arithmetic that moves the starting location of each pointer.
64 double** x = malloc2D(jsize+2*nhalo, isize+2*nhalo, nhalo, nhalo); 65 double** xnew = malloc2D(jsize+2*nhalo, isize+2*nhalo, nhalo, nhalo);
64 double** x = malloc2D(jsize+2*nhalo, isize+2*nhalo, nhalo, nhalo); 65 double** xnew = malloc2D(jsize+2*nhalo, isize+2*nhalo, nhalo, nhalo);
我们使用图 1.10 中介绍的 blur 运算符中的简单模板计算来提供工作。许多应用程序具有更复杂的计算,需要花费更多的时间。以下清单显示了模板计算循环。
We use the simple stencil calculation from the blur operator introduced in figure 1.10 to provide the work. Many applications have far more complex computations that take much more time. The following listing shows the stencil calculation loops.
Listing 8.19 Work is done in a stencil iteration loop
GhostExchange/GhostExchange_Pack/GhostExchange.cc
91 for (int iter = 0; iter < 1000; iter++){ ❶
92 cpu_timer_start(&tstart_stencil);
93
94 for (int j = 0; j < jsize; j++){ ❷
95 for (int i = 0; i < isize; i++){ ❷
96 xnew[j][i]= ❷
(x[j][i] + x[j][i-1] + x[j][i+1] + ❷
x[j-1][i] + x[j+1][i])/5.0; ❷
97 } ❷
98 } ❷
99
100 SWAP_PTR(xnew, x, xtmp); ❸
101
102 stencil_time += cpu_timer_stop(tstart_stencil);
103
104 boundarycondition_update(x, nhalo, jsize,
isize, nleft, nrght, nbot, ntop);
105 ghostcell_update(x, nhalo, corners, ❹
jsize, isize, nleft, nrght, nbot, ntop); ❹
106 } ❶GhostExchange/GhostExchange_Pack/GhostExchange.cc
91 for (int iter = 0; iter < 1000; iter++){ ❶
92 cpu_timer_start(&tstart_stencil);
93
94 for (int j = 0; j < jsize; j++){ ❷
95 for (int i = 0; i < isize; i++){ ❷
96 xnew[j][i]= ❷
(x[j][i] + x[j][i-1] + x[j][i+1] + ❷
x[j-1][i] + x[j+1][i])/5.0; ❷
97 } ❷
98 } ❷
99
100 SWAP_PTR(xnew, x, xtmp); ❸
101
102 stencil_time += cpu_timer_stop(tstart_stencil);
103
104 boundarycondition_update(x, nhalo, jsize,
isize, nleft, nrght, nbot, ntop);
105 ghostcell_update(x, nhalo, corners, ❹
jsize, isize, nleft, nrght, nbot, ntop); ❹
106 } ❶
❸ Pointer swap for old and new x arrays
❹ Ghost cell update 调用刷新 ghost cells。
❹ Ghost cell update call refreshes ghost cells.
现在我们可以看看关键的 ghost cell 更新代码。图 8.5 显示了所需的操作。幻影单元区域的宽度可以是 1 个、2 个或多个单元的深度。某些应用程序可能也需要角单元。四个进程(或等级)每个都需要来自此等级的数据;向左、向右、向上和向下。这些进程中的每一个都需要单独的通信和单独的数据缓冲区。光晕区域的宽度因应用而异,是否需要角单元也不同。
Now we can look at the critical ghost cell update code. Figure 8.5 shows the required operation. The width of the ghost cell region can be one, two, or more cells in depth. The corner cells may also be needed for some applications. Four processes (or ranks) each need data from this rank; to the left, right, top, and bottom. Each of these processes requires a separate communication and a separate data buffer. The width of the halo region varies in different applications, as well as whether the corner cells are needed.
图 8.5 显示了在 9 个过程中将幽灵单元交换为 4×4 网格的示例,该工艺具有一个单元宽的光晕和包括角落。首先更新外边界光晕,然后是水平数据交换、同步和垂直数据交换。如果不需要拐角,则可以同时进行水平和垂直交换。如果需要拐角,则水平和垂直交换之间必须同步。
Figure 8.5 shows an example of a ghost cell exchange for a 4-by-4 mesh on nine processes with a one-cell-wide halo and the corners included. The outer boundary halos are updated first and then a horizontal data exchange, a synchronization, and the vertical data exchange. If corners are not needed, the horizontal and vertical exchanges can be done at the same time. If the corners are desired, a synchronization is necessary between the horizontal and vertical exchanges.
图 8.5 ghost cell update 的 corner cell 版本首先向左和向右交换数据(在图的上半部分),然后是顶部和底部交换(在图的下半部分)。小心时,仅使用实际单元加上外部边界单元时,左右交换可以更小,尽管使其成为网格的完整垂直大小并没有什么坏处。网格周围边界单元的更新是单独完成的。
Figure 8.5 The corner cell version of the ghost cell update first exchanges data to the left and right (on the top half of the figure), followed by a top and bottom exchange (on the bottom half of the figure). With care, the left and right exchange can be smaller with just the real cells plus the outer boundary cells, although there is no harm in making it the full vertical size of the mesh. The updating of the boundary cells surrounding the mesh is done separately.
幻单元格数据更新的一个关键观察结果是,在 C 语言中,行数据是连续的,而列数据由行大小的步幅分隔。为列发送单个值的成本很高,因此我们需要以某种方式将它们组合在一起。
A key observation of the ghost cell data updates is that in C, the row data is contiguous, whereas the column data is separated by a stride that is the size of the row. Sending individual values for the columns is expensive and so we need to group these together somehow.
您可以通过多种方式使用 MPI 执行 ghost cell 更新。在清单 8.20 的第一个版本中,我们将看到一个使用 MPI_Pack 调用来打包列数据的实现。行数据仅通过标准 MPI_Isend 调用发送。ghost cell 区域的宽度由 nhalo 变量指定,并且可以通过适当的输入来请求角。
You can perform the ghost cell update with MPI in several ways. In this first version in listing 8.20, we’ll look at an implementation using the MPI_Pack call to pack the column data. The row data is sent with just a standard MPI_Isend call. The width of the ghost cell region is specified by the nhalo variable, and corners can be requested with the proper input.
列表 8.20 使用 MPI_Pack 的 2D 网格的 Ghost cell 更新例程
Listing 8.20 Ghost cell update routine for 2D mesh with MPI_Pack
GhostExchange/GhostExchange_Pack/GhostExchange.cc 167 void ghostcell_update(double **x, int nhalo, ❶ int corners, int jsize, int isize, ❶ 168 int nleft, int nrght, int nbot, int ntop, ❶ int do_timing) ❶ 169 { 170 if (do_timing) MPI_Barrier(MPI_COMM_WORLD); 171 172 struct timespec tstart_ghostcell; 173 cpu_timer_start(&tstart_ghostcell); 174 175 MPI_Request request[4*nhalo]; 176 MPI_Status status[4*nhalo]; 177 178 int jlow=0, jhgh=jsize; 179 if (corners) { 180 if (nbot == MPI_PROC_NULL) jlow = -nhalo; 181 if (ntop == MPI_PROC_NULL) jhgh = jsize+nhalo; 182 } 183 int jnum = jhgh-jlow; 184 int bufcount = jnum*nhalo; 185 int bufsize = bufcount*sizeof(double); 186 187 double xbuf_left_send[bufcount]; 188 double xbuf_rght_send[bufcount]; 189 double xbuf_rght_recv[bufcount]; 190 double xbuf_left_recv[bufcount]; 191 192 int position_left; ❷ 193 int position_right; ❷ 194 if (nleft != MPI_PROC_NULL){ ❷ 195 position_left = 0; ❷ 196 for (int j = jlow; j < jhgh; j++){ ❷ 197 MPI_Pack(&x[j][0], nhalo, MPI_DOUBLE, ❷ 198 xbuf_left_send, bufsize, ❷ &position_left, MPI_COMM_WORLD); ❷ 199 } ❷ 200 } ❷ 201 202 if (nrght != MPI_PROC_NULL){ ❷ 203 position_right = 0; ❷ 204 for (int j = jlow; j < jhgh; j++){ ❷ 205 MPI_Pack(&x[j][isize-nhalo], nhalo, ❷ MPI_DOUBLE, xbuf_rght_send, ❷ 206 bufsize, &position_right, ❷ MPI_COMM_WORLD); ❷ 207 } ❷ 208 } ❷ 209 210 MPI_Irecv(&xbuf_rght_recv, bufsize, ❸ MPI_PACKED, nrght, 1001, ❸ 211 MPI_COMM_WORLD, &request[0]); ❸ 212 MPI_Isend(&xbuf_left_send, bufsize, ❸ MPI_PACKED, nleft, 1001, ❸ 213 MPI_COMM_WORLD, &request[1]); ❸ 214 215 MPI_Irecv(&xbuf_left_recv, bufsize, ❸ MPI_PACKED, nleft, 1002, ❸ 216 MPI_COMM_WORLD, &request[2]); ❸ 217 MPI_Isend(&xbuf_rght_send, bufsize, ❸ MPI_PACKED, nrght, 1002, ❸ 218 MPI_COMM_WORLD, &request[3]); ❸ 219 MPI_Waitall(4, request, status); ❸ 220 221 if (nrght != MPI_PROC_NULL){ ❹ 222 position_right = 0; ❹ 223 for (int j = jlow; j < jhgh; j++){ ❹ 224 MPI_Unpack(xbuf_rght_recv, bufsize, ❹ &position_right, &x[j][isize], ❹ 225 nhalo, MPI_DOUBLE, MPI_COMM_WORLD); ❹ 226 } ❹ 227 } ❹ 228 229 if (nleft != MPI_PROC_NULL){ ❹ 230 position_left = 0; ❹ 231 for (int j = jlow; j < jhgh; j++){ ❹ 232 MPI_Unpack(xbuf_left_recv, bufsize, ❹ &position_left, &x[j][-nhalo], ❹ 233 nhalo, MPI_DOUBLE, MPI_COMM_WORLD); ❹ 234 } ❹ 235 } ❹ 236 237 if (corners) { 238 bufcount = nhalo*(isize+2*nhalo); 239 MPI_Irecv(&x[jsize][-nhalo], ❺ bufcount, MPI_DOUBLE, ntop, 1001, ❺ 240 MPI_COMM_WORLD, &request[0]); ❺ 241 MPI_Isend(&x[0 ][-nhalo], ❺ bufcount, MPI_DOUBLE, nbot, 1001, ❺ 242 MPI_COMM_WORLD, &request[1]); ❺ 243 244 MPI_Irecv(&x[ -nhalo][-nhalo], ❺ bufcount, MPI_DOUBLE, nbot, 1002, ❺ 245 MPI_COMM_WORLD, &request[2]); ❺ 246 MPI_Isend(&x[jsize-nhalo][-nhalo], ❺ bufcount, MPI_DOUBLE,ntop, 1002, ❺ 247 MPI_COMM_WORLD, &request[3]); ❺ 248 MPI_Waitall(4, request, status); ❻ 249 } else { 250 for (int j = 0; j<nhalo; j++){ ❼ 251 MPI_Irecv(&x[jsize+j][0], ❼ isize, MPI_DOUBLE, ntop, 1001+j*2, ❼ 252 MPI_COMM_WORLD, &request[0+j*4]); ❼ 253 MPI_Isend(&x[0+j ][0], ❼ isize, MPI_DOUBLE, nbot, 1001+j*2, ❼ 254 MPI_COMM_WORLD, &request[1+j*4]); ❼ 255 256 MPI_Irecv(&x[ -nhalo+j][0], ❼ isize, MPI_DOUBLE, nbot, 1002+j*2, ❼ 257 MPI_COMM_WORLD, &request[2+j*4]); ❼ 258 MPI_Isend(&x[jsize-nhalo+j][0], ❼ isize, MPI_DOUBLE, ntop, 1002+j*2, ❼ 259 MPI_COMM_WORLD, &request[3+j*4]); ❼ 260 } ❼ 261 MPI_Waitall(4*nhalo, request, status); ❽ 262 } 263 264 if (do_timing) MPI_Barrier(MPI_COMM_WORLD); 265 266 ghostcell_time += cpu_timer_stop(tstart_ghostcell); 267 }
GhostExchange/GhostExchange_Pack/GhostExchange.cc 167 void ghostcell_update(double **x, int nhalo, ❶ int corners, int jsize, int isize, ❶ 168 int nleft, int nrght, int nbot, int ntop, ❶ int do_timing) ❶ 169 { 170 if (do_timing) MPI_Barrier(MPI_COMM_WORLD); 171 172 struct timespec tstart_ghostcell; 173 cpu_timer_start(&tstart_ghostcell); 174 175 MPI_Request request[4*nhalo]; 176 MPI_Status status[4*nhalo]; 177 178 int jlow=0, jhgh=jsize; 179 if (corners) { 180 if (nbot == MPI_PROC_NULL) jlow = -nhalo; 181 if (ntop == MPI_PROC_NULL) jhgh = jsize+nhalo; 182 } 183 int jnum = jhgh-jlow; 184 int bufcount = jnum*nhalo; 185 int bufsize = bufcount*sizeof(double); 186 187 double xbuf_left_send[bufcount]; 188 double xbuf_rght_send[bufcount]; 189 double xbuf_rght_recv[bufcount]; 190 double xbuf_left_recv[bufcount]; 191 192 int position_left; ❷ 193 int position_right; ❷ 194 if (nleft != MPI_PROC_NULL){ ❷ 195 position_left = 0; ❷ 196 for (int j = jlow; j < jhgh; j++){ ❷ 197 MPI_Pack(&x[j][0], nhalo, MPI_DOUBLE, ❷ 198 xbuf_left_send, bufsize, ❷ &position_left, MPI_COMM_WORLD); ❷ 199 } ❷ 200 } ❷ 201 202 if (nrght != MPI_PROC_NULL){ ❷ 203 position_right = 0; ❷ 204 for (int j = jlow; j < jhgh; j++){ ❷ 205 MPI_Pack(&x[j][isize-nhalo], nhalo, ❷ MPI_DOUBLE, xbuf_rght_send, ❷ 206 bufsize, &position_right, ❷ MPI_COMM_WORLD); ❷ 207 } ❷ 208 } ❷ 209 210 MPI_Irecv(&xbuf_rght_recv, bufsize, ❸ MPI_PACKED, nrght, 1001, ❸ 211 MPI_COMM_WORLD, &request[0]); ❸ 212 MPI_Isend(&xbuf_left_send, bufsize, ❸ MPI_PACKED, nleft, 1001, ❸ 213 MPI_COMM_WORLD, &request[1]); ❸ 214 215 MPI_Irecv(&xbuf_left_recv, bufsize, ❸ MPI_PACKED, nleft, 1002, ❸ 216 MPI_COMM_WORLD, &request[2]); ❸ 217 MPI_Isend(&xbuf_rght_send, bufsize, ❸ MPI_PACKED, nrght, 1002, ❸ 218 MPI_COMM_WORLD, &request[3]); ❸ 219 MPI_Waitall(4, request, status); ❸ 220 221 if (nrght != MPI_PROC_NULL){ ❹ 222 position_right = 0; ❹ 223 for (int j = jlow; j < jhgh; j++){ ❹ 224 MPI_Unpack(xbuf_rght_recv, bufsize, ❹ &position_right, &x[j][isize], ❹ 225 nhalo, MPI_DOUBLE, MPI_COMM_WORLD); ❹ 226 } ❹ 227 } ❹ 228 229 if (nleft != MPI_PROC_NULL){ ❹ 230 position_left = 0; ❹ 231 for (int j = jlow; j < jhgh; j++){ ❹ 232 MPI_Unpack(xbuf_left_recv, bufsize, ❹ &position_left, &x[j][-nhalo], ❹ 233 nhalo, MPI_DOUBLE, MPI_COMM_WORLD); ❹ 234 } ❹ 235 } ❹ 236 237 if (corners) { 238 bufcount = nhalo*(isize+2*nhalo); 239 MPI_Irecv(&x[jsize][-nhalo], ❺ bufcount, MPI_DOUBLE, ntop, 1001, ❺ 240 MPI_COMM_WORLD, &request[0]); ❺ 241 MPI_Isend(&x[0 ][-nhalo], ❺ bufcount, MPI_DOUBLE, nbot, 1001, ❺ 242 MPI_COMM_WORLD, &request[1]); ❺ 243 244 MPI_Irecv(&x[ -nhalo][-nhalo], ❺ bufcount, MPI_DOUBLE, nbot, 1002, ❺ 245 MPI_COMM_WORLD, &request[2]); ❺ 246 MPI_Isend(&x[jsize-nhalo][-nhalo], ❺ bufcount, MPI_DOUBLE,ntop, 1002, ❺ 247 MPI_COMM_WORLD, &request[3]); ❺ 248 MPI_Waitall(4, request, status); ❻ 249 } else { 250 for (int j = 0; j<nhalo; j++){ ❼ 251 MPI_Irecv(&x[jsize+j][0], ❼ isize, MPI_DOUBLE, ntop, 1001+j*2, ❼ 252 MPI_COMM_WORLD, &request[0+j*4]); ❼ 253 MPI_Isend(&x[0+j ][0], ❼ isize, MPI_DOUBLE, nbot, 1001+j*2, ❼ 254 MPI_COMM_WORLD, &request[1+j*4]); ❼ 255 256 MPI_Irecv(&x[ -nhalo+j][0], ❼ isize, MPI_DOUBLE, nbot, 1002+j*2, ❼ 257 MPI_COMM_WORLD, &request[2+j*4]); ❼ 258 MPI_Isend(&x[jsize-nhalo+j][0], ❼ isize, MPI_DOUBLE, ntop, 1002+j*2, ❼ 259 MPI_COMM_WORLD, &request[3+j*4]); ❼ 260 } ❼ 261 MPI_Waitall(4*nhalo, request, status); ❽ 262 } 263 264 if (do_timing) MPI_Barrier(MPI_COMM_WORLD); 265 266 ghostcell_time += cpu_timer_stop(tstart_ghostcell); 267 }
❶ The update of the ghost cells from adjacent processes
❷ Packs buffers for ghost cell update for left and right neighbors
❸ Communication for left and right neighbors
❹ Unpacks buffers for left and right neighbors
❺ Ghost cell updates in one contiguous block for bottom and top neighbors
❻ Waits for all communication to complete
❼ Ghost cell updates one row at a time for bottom and top neighbors
❽ Waits for all communication to complete
当 MPI_Pack 需要在 ghost 更新中传达多种数据类型时,该调用特别有用。这些值被打包到与类型无关的缓冲区中,然后在另一端解压缩。垂直方向上的邻域通信是使用连续的行数据完成的。当包含角时,单个缓冲区效果很好。如果没有角,则发送单个 halo 行。通常只有一个或两个 halo 单元,因此这是一种合理的方法。
The MPI_Pack call is particularly useful when there are multiple data types that need to be communicated in the ghost update. The values are packed into a type-agnostic buffer and then unpacked on the other side. The neighbor communication in the vertical direction is done with contiguous row data. When there are corners included, a single buffer works well. Without corners, individual halo rows are sent. There are usually only one or two halo cells, so this is a reasonable approach.
加载通信缓冲区的另一种方法是使用数组分配。当存在单个简单数据类型(如本例中使用的双精度浮点类型)时,数组赋值是一种很好的方法。下面的清单显示了用数组赋值替换 MPI_Pack 循环的代码。
Another way to load the buffers for the communication is with an array assignment. Array assignments are a good approach when there is a single, simple data type like the double-precision float type used in this example. The following listing shows the code for replacing the MPI_Pack loops with array assignments.
列表 8.21 具有数组分配的 2D 网格的 Ghost cell 更新例程
Listing 8.21 Ghost cell update routine for 2D mesh with array assignments
GhostExchange/GhostExchange_ArrayAssign/GhostExchange.cc
190 int icount;
191 if (nleft != MPI_PROC_NULL){ ❶
192 icount = 0; ❶
193 for (int j = jlow; j < jhgh; j++){ ❶
194 for (int ll = 0; ll < nhalo; ll++){ ❶
195 xbuf_left_send[icount++] = x[j][ll]; ❶
196 } ❶
197 } ❶
198 } ❶
199 if (nrght != MPI_PROC_NULL){ ❶
200 icount = 0; ❶
201 for (int j = jlow; j < jhgh; j++){ ❶
202 for (int ll = 0; ll < nhalo; ll++){ ❶
203 xbuf_rght_send[icount++] = ❶
x[j][isize-nhalo+ll]; ❶
204 } ❶
205 } ❶
206 } ❶
207
208 MPI_Irecv(&xbuf_rght_recv, bufcount, ❷
MPI_DOUBLE, nrght, 1001, ❷
209 MPI_COMM_WORLD, &request[0]); ❷
210 MPI_Isend(&xbuf_left_send, bufcount, ❷
MPI_DOUBLE, nleft, 1001, ❷
211 MPI_COMM_WORLD, &request[1]); ❷
212
213 MPI_Irecv(&xbuf_left_recv, bufcount, ❷
MPI_DOUBLE, nleft, 1002, ❷
214 MPI_COMM_WORLD, &request[2]); ❷
215 MPI_Isend(&xbuf_rght_send, bufcount, ❷
MPI_DOUBLE, nrght, 1002, ❷
216 MPI_COMM_WORLD, &request[3]); ❷
217 MPI_Waitall(4, request, status); ❷
218
219 if (nrght != MPI_PROC_NULL){ ❸
220 icount = 0; ❸
221 for (int j = jlow; j < jhgh; j++){ ❸
222 for (int ll = 0; ll < nhalo; ll++){ ❸
223 x[j][isize+ll] = ❸
xbuf_rght_recv[icount++]; ❸
224 } ❸
225 } ❸
226 } ❸
227 if (nleft != MPI_PROC_NULL){ ❸
228 icount = 0; ❸
229 for (int j = jlow; j < jhgh; j++){ ❸
230 for (int ll = 0; ll < nhalo; ll++){ ❸
231 x[j][-nhalo+ll] = ❸
xbuf_left_recv[icount++]; ❸
232 } ❸
233 } ❸
234 } ❸GhostExchange/GhostExchange_ArrayAssign/GhostExchange.cc
190 int icount;
191 if (nleft != MPI_PROC_NULL){ ❶
192 icount = 0; ❶
193 for (int j = jlow; j < jhgh; j++){ ❶
194 for (int ll = 0; ll < nhalo; ll++){ ❶
195 xbuf_left_send[icount++] = x[j][ll]; ❶
196 } ❶
197 } ❶
198 } ❶
199 if (nrght != MPI_PROC_NULL){ ❶
200 icount = 0; ❶
201 for (int j = jlow; j < jhgh; j++){ ❶
202 for (int ll = 0; ll < nhalo; ll++){ ❶
203 xbuf_rght_send[icount++] = ❶
x[j][isize-nhalo+ll]; ❶
204 } ❶
205 } ❶
206 } ❶
207
208 MPI_Irecv(&xbuf_rght_recv, bufcount, ❷
MPI_DOUBLE, nrght, 1001, ❷
209 MPI_COMM_WORLD, &request[0]); ❷
210 MPI_Isend(&xbuf_left_send, bufcount, ❷
MPI_DOUBLE, nleft, 1001, ❷
211 MPI_COMM_WORLD, &request[1]); ❷
212
213 MPI_Irecv(&xbuf_left_recv, bufcount, ❷
MPI_DOUBLE, nleft, 1002, ❷
214 MPI_COMM_WORLD, &request[2]); ❷
215 MPI_Isend(&xbuf_rght_send, bufcount, ❷
MPI_DOUBLE, nrght, 1002, ❷
216 MPI_COMM_WORLD, &request[3]); ❷
217 MPI_Waitall(4, request, status); ❷
218
219 if (nrght != MPI_PROC_NULL){ ❸
220 icount = 0; ❸
221 for (int j = jlow; j < jhgh; j++){ ❸
222 for (int ll = 0; ll < nhalo; ll++){ ❸
223 x[j][isize+ll] = ❸
xbuf_rght_recv[icount++]; ❸
224 } ❸
225 } ❸
226 } ❸
227 if (nleft != MPI_PROC_NULL){ ❸
228 icount = 0; ❸
229 for (int j = jlow; j < jhgh; j++){ ❸
230 for (int ll = 0; ll < nhalo; ll++){ ❸
231 x[j][-nhalo+ll] = ❸
xbuf_left_recv[icount++]; ❸
232 } ❸
233 } ❸
234 } ❸
❷ Performs the communication between left and right neighbors
❸ Copies the receive buffers into the ghost cells
MPI_Irecv 和 MPI_Isend 调用现在使用 count 和 MPI_DOUBLE 数据类型,而不是 MPI_Pack 的通用字节类型。我们还需要知道将数据复制到通信缓冲区和将数据复制到通信缓冲区的数据类型。
The MPI_Irecv and MPI_Isend calls now use a count and the MPI_DOUBLE data type rather than the generic byte type of MPI_Pack. We also need to know the data type for copying data into and out of the communication buffer.
您还可以对 3D 模板计算进行 ghost cell 交换。我们将在清单 8.22 中执行此操作。但是,设置要复杂一些。流程布局首先计算为 xcoord、ycoord 和 zcoord 值。然后确定邻居,并计算每个处理器上的数据大小。
You can also do a ghost cell exchange for a 3D stencil calculation. We’ll do that in listing 8.22. The setup is a little more complicated, however. The process layout is first calculated as xcoord, ycoord, and zcoord values. Then the neighbors are determined, and the sizes of the data on each processor calculated.
Listing 8.22 Setup for a 3D mesh
GhostExchange/GhostExchange3D_*/GhostExchange.cc 63 int xcoord = rank%nprocx; ❶ 64 int ycoord = rank/nprocx%nprocy; ❶ 65 int zcoord = rank/(nprocx*nprocy); ❶ 66 67 int nleft = (xcoord > 0 ) ? ❷ rank - 1 : MPI_PROC_NULL; ❷ 68 int nrght = (xcoord < nprocx-1) ? ❷ rank + 1 : MPI_PROC_NULL; ❷ 69 int nbot = (ycoord > 0 ) ? ❷ rank - nprocx : MPI_PROC_NULL; ❷ 70 int ntop = (ycoord < nprocy-1) ? ❷ rank + nprocx : MPI_PROC_NULL; ❷ 71 int nfrnt = (zcoord > 0 ) ? ❷ rank - nprocx * nprocy : MPI_PROC_NULL; ❷ 72 int nback = (zcoord < nprocz-1) ? ❷ rank + nprocx * nprocy : MPI_PROC_NULL; ❷ 73 74 int ibegin = imax *(xcoord )/nprocx; ❸ 75 int iend = imax *(xcoord+1)/nprocx; ❸ 76 int isize = iend - ibegin; ❸ 77 int jbegin = jmax *(ycoord )/nprocy; ❸ 78 int jend = jmax *(ycoord+1)/nprocy; ❸ 79 int jsize = jend - jbegin; ❸ 80 int kbegin = kmax *(zcoord )/nprocz; ❸ 81 int kend = kmax *(zcoord+1)/nprocz; ❸ 82 int ksize = kend - kbegin; ❸
GhostExchange/GhostExchange3D_*/GhostExchange.cc 63 int xcoord = rank%nprocx; ❶ 64 int ycoord = rank/nprocx%nprocy; ❶ 65 int zcoord = rank/(nprocx*nprocy); ❶ 66 67 int nleft = (xcoord > 0 ) ? ❷ rank - 1 : MPI_PROC_NULL; ❷ 68 int nrght = (xcoord < nprocx-1) ? ❷ rank + 1 : MPI_PROC_NULL; ❷ 69 int nbot = (ycoord > 0 ) ? ❷ rank - nprocx : MPI_PROC_NULL; ❷ 70 int ntop = (ycoord < nprocy-1) ? ❷ rank + nprocx : MPI_PROC_NULL; ❷ 71 int nfrnt = (zcoord > 0 ) ? ❷ rank - nprocx * nprocy : MPI_PROC_NULL; ❷ 72 int nback = (zcoord < nprocz-1) ? ❷ rank + nprocx * nprocy : MPI_PROC_NULL; ❷ 73 74 int ibegin = imax *(xcoord )/nprocx; ❸ 75 int iend = imax *(xcoord+1)/nprocx; ❸ 76 int isize = iend - ibegin; ❸ 77 int jbegin = jmax *(ycoord )/nprocy; ❸ 78 int jend = jmax *(ycoord+1)/nprocy; ❸ 79 int jsize = jend - jbegin; ❸ 80 int kbegin = kmax *(zcoord )/nprocz; ❸ 81 int kend = kmax *(zcoord+1)/nprocz; ❸ 82 int ksize = kend - kbegin; ❸
❶ Sets up the process coordinates
❷ Calculates the neighbor processes for each process
❸ Calculates the beginning and ending index for each process and then the size
ghost cell 更新,包括数组复制到缓冲区、通信和复制出去,有几百行长,不能在这里显示。有关详细实施,请参阅本章随附的代码示例 (https://github.com/EssentialsofParallelComputing/Chapter8)。我们将在第 8.5.1 节中展示 ghost cell update 的 MPI 数据类型版本。
The ghost cell update, including the array copies into buffers, communication, and copying out is a couple of hundred lines long and can’t be shown here. Refer to the code examples (https://github.com/EssentialsofParallelComputing/Chapter8) that accompany the chapter for the detailed implementation. We’ll show an MPI data type version of the ghost cell update in section 8.5.1.
当我们看到如何将基本的 MPI 组件组合成更高级别的功能时,MPI 的出色设计就变得显而易见。我们在 8.3.3 节中尝到了这一点,当时我们创建了一个新的 double-double 类型和一个新的 reduction 运算符。这种可扩展性为 MPI 提供了重要的功能。我们将介绍几个在常见数据并行应用程序中有用的高级函数。这些包括
The excellent design of MPI becomes apparent as we see how basic MPI components can be combined into higher-level functionality. We got a taste of this in section 8.3.3, when we created a new double-double type and a new reduction operator. This extensibility gives MPI important capabilities. We’ll look at a couple of these advanced functions that are useful in common data parallel applications. These include
MPI custom data types—Builds new data types from the basic MPI type building blocks.
Topology support—A basic Cartesian regular grid topology and a more general graph topology are both available. We’ll just look at the simpler MPI Cartesian functions.
MPI 具有一组丰富的函数,可用于从基本 MPI 类型创建新的自定义 MPI 数据类型。这允许将复杂数据封装到可在通信调用中使用的单个自定义数据类型。因此,单个通信调用可以将许多较小的数据作为一个单元发送或接收。以下是一些 MPI 数据类型创建函数的列表:
MPI has a rich set of functions to create new, custom MPI data types from the basic MPI types. This allows the encapsulation of complex data into a single custom data type that you can use in communication calls. As a result, a single communication call can send or receive many smaller pieces of data as a unit. Here is a list of some of the MPI data type creation functions:
MPI_Type_contiguous—Makes a block of contiguous data into a type.
MPI_Type_vector—Creates a type out of blocks of strided data.
MPI_Type_create_subarray—Creates a rectangular subset of a larger array.
MPI_Type_indexed 或 MPI_Type_create_hindexed - 创建一组不规则的索引,由一组块长度和位移描述。hindexed 版本以字节而不是数据类型表示位移,以实现更通用性。
MPI_Type_indexed or MPI_Type_create_hindexed—Creates an irregular set of indices described by a set of block lengths and displacements. The hindexed version expresses the displacements in bytes instead of a data type for more generality.
MPI_Type_create_struct - 创建一个数据类型,以可移植的方式将数据项封装在结构中,以便编译器进行填充。
MPI_Type_create_struct—Creates a data type encapsulating the data items in a structure in a portable way that accounts for padding by the compiler.
您将发现一个可视化插图有助于理解其中一些数据类型。图 8.6 显示了一些更简单和更常用的函数,包括 MPI_Type_ 连续、MPI_Type_vector 和 MPI_Type_create_subarray。
You’ll find a visual illustration to be helpful in understanding some of these data types. Figure 8.6 shows some of the simpler and more commonly used functions including MPI_Type_ contiguous, MPI_Type_vector, and MPI_Type_create_subarray.
图 8.6 三种 MPI 自定义数据类型及其创建过程中使用的参数的图示
Figure 8.6 Three MPI custom data types with illustrations of the arguments used in their creation
描述数据类型并将其转换为新数据类型后,必须先对其进行初始化,然后才能使用。为此,有几个额外的例程可以提交和释放类型。类型必须在使用前提交,并且必须释放它以避免内存泄漏。这些例程包括
Once a data type is described and made into a new data type, it must be initialized before it is used. For this purpose, there are a couple of additional routines to commit and free the types. A type must be committed before use and it must be freed to avoid a memory leak. The routines include
MPI_Type_Commit—Initializes the new custom type with needed memory allocation or other setup
MPI_Type_Free—Frees any memory or data structure entries from the creation of the data type
我们可以通过定义自定义 MPI 数据类型来大大简化 ghost cell 通信,如图 8.6 所示,以表示数据列并避免MPI_Pack调用。通过定义 MPI 数据类型,可以避免额外的数据复制。数据可以从其常规位置直接复制到 MPI 发送缓冲区中。让我们看看在清单 8.23 中是怎么做到的。清单 8.24 显示了程序的第二部分。
We can greatly simplify the ghost cell communication by defining a custom MPI data type as was shown in figure 8.6 to represent the column of data and to avoid the MPI_Pack calls. By defining an MPI data type, an extra data copy can be avoided. The data can be copied from its regular location straight into the MPI send buffers. Let’s see how this is done in listing 8.23. Listing 8.24 shows the second part of the program.
我们首先设置自定义数据类型。我们使用 MPI_Type_vector 调用来获得 strided 数组访问集。对于包含角的垂直类型的连续数据,我们使用 MPI_Type_contiguous 调用,在第 139 行和第 140 行中,我们在 MPI_Finalize 之前释放末尾的数据类型。
We first set up the custom data types. We use the MPI_Type_vector call for sets of strided array accesses. For the contiguous data for the vertical type when we include corners, we use the MPI_Type_contiguous call, and in lines 139 and 140, we free the data type at the end before the MPI_Finalize.
示例 8.23 为 ghost cell update 创建一个 2D 向量数据类型
Listing 8.23 Creating a 2D vector data type for the ghost cell update
GhostExchange/GhostExchange_VectorTypes/GhostExchange.cc
56 int jlow=0, jhgh=jsize;
57 if (corners) {
58 if (nbot == MPI_PROC_NULL) jlow = -nhalo;
59 if (ntop == MPI_PROC_NULL) jhgh = jsize+nhalo;
60 }
61 int jnum = jhgh-jlow;
62
63 MPI_Datatype horiz_type;
64 MPI_Type_vector(jnum, nhalo, isize+2*nhalo,
MPI_DOUBLE, &horiz_type);
65 MPI_Type_commit(&horiz_type);
66
67 MPI_Datatype vert_type;
68 if (! corners){
69 MPI_Type_vector(nhalo, isize, isize+2*nhalo,
MPI_DOUBLE, &vert_type);
70 } else {
71 MPI_Type_contiguous(nhalo*(isize+2*nhalo),
MPI_DOUBLE, &vert_type);
72 }
73 MPI_Type_commit(&vert_type);
...
139 MPI_Type_free(&horiz_type);
140 MPI_Type_free(&vert_type);GhostExchange/GhostExchange_VectorTypes/GhostExchange.cc
56 int jlow=0, jhgh=jsize;
57 if (corners) {
58 if (nbot == MPI_PROC_NULL) jlow = -nhalo;
59 if (ntop == MPI_PROC_NULL) jhgh = jsize+nhalo;
60 }
61 int jnum = jhgh-jlow;
62
63 MPI_Datatype horiz_type;
64 MPI_Type_vector(jnum, nhalo, isize+2*nhalo,
MPI_DOUBLE, &horiz_type);
65 MPI_Type_commit(&horiz_type);
66
67 MPI_Datatype vert_type;
68 if (! corners){
69 MPI_Type_vector(nhalo, isize, isize+2*nhalo,
MPI_DOUBLE, &vert_type);
70 } else {
71 MPI_Type_contiguous(nhalo*(isize+2*nhalo),
MPI_DOUBLE, &vert_type);
72 }
73 MPI_Type_commit(&vert_type);
...
139 MPI_Type_free(&horiz_type);
140 MPI_Type_free(&vert_type);
然后,您可以使用 MPI 数据类型更简洁、更出色地编写ghostcell_update,如下面的清单所示。如果我们需要更新角,则需要在两个通信通道之间进行同步。
You can then write the ghostcell_update more concisely and with better performance using the MPI data types as in the following listing. If we need to update corners, a synchronization is needed between the two communication passes.
列表 8.24 使用 vector 数据类型的 2D ghost cell 更新例程
Listing 8.24 2D ghost cell update routine using the vector data type
GhostExchange/GhostExchange_VectorTypes/GhostExchange.cc
197 int jlow=0, jhgh=jsize, ilow=0, waitcount=8, ib=4;
198 if (corners) {
199 if (nbot == MPI_PROC_NULL) jlow = -nhalo;
200 ilow = -nhalo;
201 waitcount = 4;
202 ib = 0;
203 }
204
205 MPI_Request request[waitcount];
206 MPI_Status status[waitcount];
207
208 MPI_Irecv(&x[jlow][isize], 1, ❶
horiz_type, nrght, 1001, ❶
209 MPI_COMM_WORLD, &request[0]); ❶
210 MPI_Isend(&x[jlow][0], 1, ❶
horiz_type, nleft, 1001, ❶
211 MPI_COMM_WORLD, &request[1]); ❶
212
213 MPI_Irecv(&x[jlow][-nhalo], 1, ❶
horiz_type, nleft, 1002, ❶
214 MPI_COMM_WORLD, &request[2]); ❶
215 MPI_Isend(&x[jlow][isize-nhalo], 1, ❶
horiz_type, nrght, 1002, ❶
216 MPI_COMM_WORLD, &request[3]); ❶
217
218 if (corners) ❷
MPI_Waitall(4, request, status); ❷
219
220 MPI_Irecv(&x[jsize][ilow], 1, ❸
vert_type, ntop, 1003, ❸
221 MPI_COMM_WORLD, &request[ib+0]); ❸
222 MPI_Isend(&x[0 ][ilow], 1, ❸
vert_type, nbot, 1003, ❸
223 MPI_COMM_WORLD, &request[ib+1]); ❸
224
225 MPI_Irecv(&x[ -nhalo][ilow], 1, ❸
vert_type, nbot, 1004, ❸
226 MPI_COMM_WORLD, &request[ib+2]); ❸
227 MPI_Isend(&x[jsize-nhalo][ilow], 1, ❸
vert_type, ntop, 1004, ❸
228 MPI_COMM_WORLD, &request[ib+3]); ❸
229
230 MPI_Waitall(waitcount, request, status);GhostExchange/GhostExchange_VectorTypes/GhostExchange.cc
197 int jlow=0, jhgh=jsize, ilow=0, waitcount=8, ib=4;
198 if (corners) {
199 if (nbot == MPI_PROC_NULL) jlow = -nhalo;
200 ilow = -nhalo;
201 waitcount = 4;
202 ib = 0;
203 }
204
205 MPI_Request request[waitcount];
206 MPI_Status status[waitcount];
207
208 MPI_Irecv(&x[jlow][isize], 1, ❶
horiz_type, nrght, 1001, ❶
209 MPI_COMM_WORLD, &request[0]); ❶
210 MPI_Isend(&x[jlow][0], 1, ❶
horiz_type, nleft, 1001, ❶
211 MPI_COMM_WORLD, &request[1]); ❶
212
213 MPI_Irecv(&x[jlow][-nhalo], 1, ❶
horiz_type, nleft, 1002, ❶
214 MPI_COMM_WORLD, &request[2]); ❶
215 MPI_Isend(&x[jlow][isize-nhalo], 1, ❶
horiz_type, nrght, 1002, ❶
216 MPI_COMM_WORLD, &request[3]); ❶
217
218 if (corners) ❷
MPI_Waitall(4, request, status); ❷
219
220 MPI_Irecv(&x[jsize][ilow], 1, ❸
vert_type, ntop, 1003, ❸
221 MPI_COMM_WORLD, &request[ib+0]); ❸
222 MPI_Isend(&x[0 ][ilow], 1, ❸
vert_type, nbot, 1003, ❸
223 MPI_COMM_WORLD, &request[ib+1]); ❸
224
225 MPI_Irecv(&x[ -nhalo][ilow], 1, ❸
vert_type, nbot, 1004, ❸
226 MPI_COMM_WORLD, &request[ib+2]); ❸
227 MPI_Isend(&x[jsize-nhalo][ilow], 1, ❸
vert_type, ntop, 1004, ❸
228 MPI_COMM_WORLD, &request[ib+3]); ❸
229
230 MPI_Waitall(waitcount, request, status);
❶ 使用自定义 horiz_type MPI 数据类型向左和向右发送
❶ Send left and right using the custom horiz_type MPI data type
❷ Synchronize if corners are sent.
❸ Updates ghost cells on top and bottom
使用 MPI 数据类型的原因通常是性能更好。它确实允许 MPI 实现在某些情况下避免额外的副本。但从我们的角度来看,MPI 数据类型的最大原因是代码更简洁、更简单,出现错误的机会更少。
The reason for using MPI data types is usually given as better performance. It does allow MPI implementation to avoid an extra copy in some cases. But from our perspective, the biggest reason for MPI data types is the cleaner, simpler code and fewer opportunities for bugs.
使用 MPI 数据类型的 3D 版本稍微复杂一些。我们在以下清单中使用 MPI_ Type_create_subarray 来创建三种要在通信中使用的自定义 MPI 数据类型。
The 3D version using MPI data types is a little more complicated. We use MPI_ Type_create_subarray in the following listing to create three custom MPI data types to be used in the communication.
清单 8.25 为 3D ghost cells 创建 MPI 子数组数据类型
Listing 8.25 Creating an MPI subarray data type for 3D ghost cells
GhostExchange/GhostExchange3D_VectorTypes/GhostExchange.cc
109 int array_sizes[] = {ksize+2*nhalo, jsize+2*nhalo, isize+2*nhalo};
110 if (corners) {
111 int subarray_starts[] = {0, 0, 0}; ❶
112 int hsubarray_sizes[] = ❶
{ksize+2*nhalo, jsize+2*nhalo, ❶
nhalo}; ❶
113 MPI_Type_create_subarray(3, ❶
array_sizes, hsubarray_sizes, ❶
114 subarray_starts, MPI_ORDER_C, ❶
MPI_DOUBLE, &horiz_type); ❶
115
116 int vsubarray_sizes[] = ❷
{ksize+2*nhalo, nhalo, ❷
isize+2*nhalo}; ❷
117 MPI_Type_create_subarray(3, ❷
array_sizes, vsubarray_sizes, ❷
118 subarray_starts, MPI_ORDER_C, ❷
MPI_DOUBLE, &vert_type); ❷
119
120 int dsubarray_sizes[] = ❸
{nhalo, jsize+2*nhalo, ❸
isize+2*nhalo}; ❸
121 MPI_Type_create_subarray(3, ❸
array_sizes, dsubarray_sizes, ❸
122 subarray_starts, MPI_ORDER_C, ❸
MPI_DOUBLE, &depth_type); ❸
123 } else {
124 int hsubarray_starts[] = {nhalo,nhalo,0}; ❶
125 int hsubarray_sizes[] = {ksize, jsize, ❶
nhalo}; ❶
126 MPI_Type_create_subarray(3, ❶
array_sizes, hsubarray_sizes, ❶
127 hsubarray_starts, MPI_ORDER_C, ❶
MPI_DOUBLE, &horiz_type); ❶
128
129 int vsubarray_starts[] = {nhalo, 0, ❷
nhalo}; ❷
130 int vsubarray_sizes[] = {ksize, nhalo, ❷
isize}; ❷
131 MPI_Type_create_subarray(3, ❷
array_sizes, vsubarray_sizes, ❷
132 vsubarray_starts, MPI_ORDER_C, ❷
MPI_DOUBLE, &vert_type); ❷
133
134 int dsubarray_starts[] = {0, nhalo, ❸
nhalo}; ❸
135 int dsubarray_sizes[] = {nhalo, ksize, ❸
isize}; ❸
136 MPI_Type_create_subarray(3, ❸
array_sizes, dsubarray_sizes, ❸
137 dsubarray_starts, MPI_ORDER_C, ❸
MPI_DOUBLE, &depth_type); ❸
138 }
139
140 MPI_Type_commit(&horiz_type);
141 MPI_Type_commit(&vert_type);
142 MPI_Type_commit(&depth_type);GhostExchange/GhostExchange3D_VectorTypes/GhostExchange.cc
109 int array_sizes[] = {ksize+2*nhalo, jsize+2*nhalo, isize+2*nhalo};
110 if (corners) {
111 int subarray_starts[] = {0, 0, 0}; ❶
112 int hsubarray_sizes[] = ❶
{ksize+2*nhalo, jsize+2*nhalo, ❶
nhalo}; ❶
113 MPI_Type_create_subarray(3, ❶
array_sizes, hsubarray_sizes, ❶
114 subarray_starts, MPI_ORDER_C, ❶
MPI_DOUBLE, &horiz_type); ❶
115
116 int vsubarray_sizes[] = ❷
{ksize+2*nhalo, nhalo, ❷
isize+2*nhalo}; ❷
117 MPI_Type_create_subarray(3, ❷
array_sizes, vsubarray_sizes, ❷
118 subarray_starts, MPI_ORDER_C, ❷
MPI_DOUBLE, &vert_type); ❷
119
120 int dsubarray_sizes[] = ❸
{nhalo, jsize+2*nhalo, ❸
isize+2*nhalo}; ❸
121 MPI_Type_create_subarray(3, ❸
array_sizes, dsubarray_sizes, ❸
122 subarray_starts, MPI_ORDER_C, ❸
MPI_DOUBLE, &depth_type); ❸
123 } else {
124 int hsubarray_starts[] = {nhalo,nhalo,0}; ❶
125 int hsubarray_sizes[] = {ksize, jsize, ❶
nhalo}; ❶
126 MPI_Type_create_subarray(3, ❶
array_sizes, hsubarray_sizes, ❶
127 hsubarray_starts, MPI_ORDER_C, ❶
MPI_DOUBLE, &horiz_type); ❶
128
129 int vsubarray_starts[] = {nhalo, 0, ❷
nhalo}; ❷
130 int vsubarray_sizes[] = {ksize, nhalo, ❷
isize}; ❷
131 MPI_Type_create_subarray(3, ❷
array_sizes, vsubarray_sizes, ❷
132 vsubarray_starts, MPI_ORDER_C, ❷
MPI_DOUBLE, &vert_type); ❷
133
134 int dsubarray_starts[] = {0, nhalo, ❸
nhalo}; ❸
135 int dsubarray_sizes[] = {nhalo, ksize, ❸
isize}; ❸
136 MPI_Type_create_subarray(3, ❸
array_sizes, dsubarray_sizes, ❸
137 dsubarray_starts, MPI_ORDER_C, ❸
MPI_DOUBLE, &depth_type); ❸
138 }
139
140 MPI_Type_commit(&horiz_type);
141 MPI_Type_commit(&vert_type);
142 MPI_Type_commit(&depth_type);
❶ 使用 MPI_Type_create_subarray 创建水平数据类型
❶ Creates a horizontal data type using MPI_Type_create_subarray
❷ 使用 MPI_Type_create_subarray 创建 vertical 数据类型
❷ Creates a vertical data type using MPI_Type_create_subarray
❸ 使用 MPI_Type_create_subarray 创建 depth 数据类型
❸ Creates a depth data type using MPI_Type_create_subarray
下面的清单显示了使用这三种 MPI 数据类型的通信例程非常简洁。
The following listing shows that the communication routine using these three MPI data types is pretty concise.
清单 8.26 使用 MPI 数据类型的 3D ghost cell 更新
Listing 8.26 The 3D ghost cell update using MPI data types
GhostExchange/GhostExchange3D_VectorTypes/GhostExchange.cc
334 int waitcount = 12, ib1 = 4, ib2 = 8;
335 if (corners) {
336 waitcount=4;
337 ib1 = 0, ib2 = 0;
338 }
339
340 MPI_Request request[waitcount*nhalo];
341 MPI_Status status[waitcount*nhalo];
342
343 MPI_Irecv(&x[-nhalo][-nhalo][isize], 1, ❶
horiz_type, nrght, 1001, ❶
344 MPI_COMM_WORLD, &request[0]); ❶
345 MPI_Isend(&x[-nhalo][-nhalo][0], 1, ❶
horiz_type, nleft, 1001, ❶
346 MPI_COMM_WORLD, &request[1]); ❶
347
348 MPI_Irecv(&x[-nhalo][-nhalo][-nhalo], 1, ❶
horiz_type, nleft, 1002, ❶
349 MPI_COMM_WORLD, &request[2]); ❶
350 MPI_Isend(&x[-nhalo][-nhalo][isize-1], 1, ❶
horiz_type, nrght, 1002, ❶
351 MPI_COMM_WORLD, &request[3]); ❶
352 if (corners) ❷
MPI_Waitall(4, request, status); ❷
353
354 MPI_Irecv(&x[-nhalo][jsize][-nhalo], 1, ❸
vert_type, ntop, 1003, ❸
355 MPI_COMM_WORLD, &request[ib1+0]); ❸
356 MPI_Isend(&x[-nhalo][0][-nhalo], 1, ❸
vert_type, nbot, 1003, ❸
357 MPI_COMM_WORLD, &request[ib1+1]); ❸
358
359 MPI_Irecv(&x[-nhalo][-nhalo][-nhalo], 1, ❸
vert_type, nbot, 1004, ❸
360 MPI_COMM_WORLD, &request[ib1+2]); ❸
361 MPI_Isend(&x[-nhalo][jsize-1][-nhalo], 1, ❸
vert_type, ntop, 1004, ❸
362 MPI_COMM_WORLD, &request[ib1+3]); ❸
363 if (corners) ❷
MPI_Waitall(4, request, status); ❷
364
365 MPI_Irecv(&x[ksize][-nhalo][-nhalo], 1, ❹
depth_type, nback, 1005, ❹
366 MPI_COMM_WORLD, &request[ib2+0]); ❹
367 MPI_Isend(&x[0][-nhalo][-nhalo], 1, ❹
depth_type, nfrnt, 1005, ❹
368 MPI_COMM_WORLD, &request[ib2+1]); ❹
369
370 MPI_Irecv(&x[-nhalo][-nhalo][-nhalo], 1, ❹
depth_type, nfrnt, 1006, ❹
371 MPI_COMM_WORLD, &request[ib2+2]); ❹
372 MPI_Isend(&x[ksize-1][-nhalo][-nhalo], 1, ❹
depth_type, nback, 1006, ❹
373 MPI_COMM_WORLD, &request[ib2+3]); ❹
374 MPI_Waitall(waitcount, request, status); ❹GhostExchange/GhostExchange3D_VectorTypes/GhostExchange.cc
334 int waitcount = 12, ib1 = 4, ib2 = 8;
335 if (corners) {
336 waitcount=4;
337 ib1 = 0, ib2 = 0;
338 }
339
340 MPI_Request request[waitcount*nhalo];
341 MPI_Status status[waitcount*nhalo];
342
343 MPI_Irecv(&x[-nhalo][-nhalo][isize], 1, ❶
horiz_type, nrght, 1001, ❶
344 MPI_COMM_WORLD, &request[0]); ❶
345 MPI_Isend(&x[-nhalo][-nhalo][0], 1, ❶
horiz_type, nleft, 1001, ❶
346 MPI_COMM_WORLD, &request[1]); ❶
347
348 MPI_Irecv(&x[-nhalo][-nhalo][-nhalo], 1, ❶
horiz_type, nleft, 1002, ❶
349 MPI_COMM_WORLD, &request[2]); ❶
350 MPI_Isend(&x[-nhalo][-nhalo][isize-1], 1, ❶
horiz_type, nrght, 1002, ❶
351 MPI_COMM_WORLD, &request[3]); ❶
352 if (corners) ❷
MPI_Waitall(4, request, status); ❷
353
354 MPI_Irecv(&x[-nhalo][jsize][-nhalo], 1, ❸
vert_type, ntop, 1003, ❸
355 MPI_COMM_WORLD, &request[ib1+0]); ❸
356 MPI_Isend(&x[-nhalo][0][-nhalo], 1, ❸
vert_type, nbot, 1003, ❸
357 MPI_COMM_WORLD, &request[ib1+1]); ❸
358
359 MPI_Irecv(&x[-nhalo][-nhalo][-nhalo], 1, ❸
vert_type, nbot, 1004, ❸
360 MPI_COMM_WORLD, &request[ib1+2]); ❸
361 MPI_Isend(&x[-nhalo][jsize-1][-nhalo], 1, ❸
vert_type, ntop, 1004, ❸
362 MPI_COMM_WORLD, &request[ib1+3]); ❸
363 if (corners) ❷
MPI_Waitall(4, request, status); ❷
364
365 MPI_Irecv(&x[ksize][-nhalo][-nhalo], 1, ❹
depth_type, nback, 1005, ❹
366 MPI_COMM_WORLD, &request[ib2+0]); ❹
367 MPI_Isend(&x[0][-nhalo][-nhalo], 1, ❹
depth_type, nfrnt, 1005, ❹
368 MPI_COMM_WORLD, &request[ib2+1]); ❹
369
370 MPI_Irecv(&x[-nhalo][-nhalo][-nhalo], 1, ❹
depth_type, nfrnt, 1006, ❹
371 MPI_COMM_WORLD, &request[ib2+2]); ❹
372 MPI_Isend(&x[ksize-1][-nhalo][-nhalo], 1, ❹
depth_type, nback, 1006, ❹
373 MPI_COMM_WORLD, &request[ib2+3]); ❹
374 MPI_Waitall(waitcount, request, status); ❹
❶ Ghost cell update for the horizontal direction.
❷ Synchronize if corners are needed in the update.
❸ Ghost cell update for the vertical direction.
❹ Ghost cell update for the depth direction.
在本节中,我们将向您展示 MPI 中的拓扑功能是如何工作的。该操作仍然是图 8.5 所示的 ghost 交换,但我们可以通过使用笛卡尔函数来简化编码。未涵盖非结构化应用程序的通用图形函数。在继续讨论通信例程之前,我们将从设置例程开始。
In this section, we’ll show you how the topology functions in MPI work. The operation is still the ghost exchange shown in figure 8.5, but we can simplify the coding by using Cartesian functions. Not covered are general graph functions for unstructured applications. We’ll start with the setup routines before moving on to the communication routines.
设置例程需要设置过程网格分配的值,然后像清单 8.18 和 8.22 中那样设置邻居。如清单 8.24 (2D) 和清单 8.25 (3D) 所示,该过程将 dims 数组设置为每个维度中使用的处理器数量。如果 dims 数组中的任何值为零,则 MPI_Dims_create 函数会计算一些有效的值。请注意,每个方向上的进程数不考虑网格大小,因此可能无法为较长而窄的问题生成合适的值。考虑一个 8x8x1000 的网格,并为其提供 8 个处理器;进程网格将为 2x2x2,从而在每个进程上产生 4x4x500 的网格域。
The setup routines need to set the values for the process grid assignments and then to set the neighbors as was done in listings 8.18 and 8.22. As shown in listing 8.24 for 2D and listing 8.25 for 3D, the process sets the dims array to the number of processors to use in each dimension. If any of the values in the dims array are zero, the MPI_Dims_create function calculates some values that will work. Note that the number of processes in each direction does not take into account the mesh size and may not produce good values for long, narrow problems. Consider the case of a mesh that is 8x8x1000 and give it 8 processors; the process grid will be 2x2x2, resulting in a mesh domain of 4x4x500 on each process.
MPI_Cart_create 采用生成的 dims 数组和输入数组 periodic,该数组声明边界是否环绕到另一侧,反之亦然。最后一个参数是允许 MPI 对进程进行重新排序的 reorder 参数。在此示例中,它为零 (false)。现在,我们有一个包含有关拓扑信息的新 communicator。
MPI_Cart_create takes the resulting dims array and an input array, periodic, that declares whether a boundary wraps to the opposite side and vice-versa. The last argument is the reorder argument that lets MPI reorder processes. It is zero (false) in this example. Now we have a new communicator that contains information about the topology.
获取流程网格布局只是调用 MPI_Cart_coords。获取邻居是通过调用 MPI_Cart_shift 完成的,第二个参数指定方向,第三个参数指定该方向的位移或进程数。输出是相邻处理器的列。
Getting the process grid layout is just a call to MPI_Cart_coords. Getting neighbors is done with a call to MPI_Cart_shift with the second argument specifying the direction and the third argument the displacement or number of processes in that direction. The output is the ranks of the adjacent processors.
Listing 8.27 2D Cartesian topology support in MPI
GhostExchange/CartExchange_Neighbor/CartExchange.cc
43 int dims[2] = {nprocy, nprocx};
44 int periodic[2]={0,0};
45 int coords[2];
46 MPI_Dims_create(nprocs, 2, dims);
47 MPI_Comm cart_comm;
48 MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periodic, 0, &cart_comm);
49 MPI_Cart_coords(cart_comm, rank, 2, coords);
50
51 int nleft, nrght, nbot, ntop;
52 MPI_Cart_shift(cart_comm, 1, 1, &nleft, &nrght);
53 MPI_Cart_shift(cart_comm, 0, 1, &nbot, &ntop);GhostExchange/CartExchange_Neighbor/CartExchange.cc
43 int dims[2] = {nprocy, nprocx};
44 int periodic[2]={0,0};
45 int coords[2];
46 MPI_Dims_create(nprocs, 2, dims);
47 MPI_Comm cart_comm;
48 MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periodic, 0, &cart_comm);
49 MPI_Cart_coords(cart_comm, rank, 2, coords);
50
51 int nleft, nrght, nbot, ntop;
52 MPI_Cart_shift(cart_comm, 1, 1, &nleft, &nrght);
53 MPI_Cart_shift(cart_comm, 0, 1, &nbot, &ntop);
3D 笛卡尔拓扑设置类似,但具有三个维度,如下面的清单所示。
The 3D Cartesian topology setup is similar but with three dimensions as the following listing shows.
Listing 8.28 MPI 中的 3D 笛卡尔拓扑支持
Listing 8.28 3D Cartesian topology support in MPI
GhostExchange/CartExchange3D_Neighbor/CartExchange.cc
65 int dims[3] = {nprocz, nprocy, nprocx};
66 int periods[3]={0,0,0};
67 int coords[3];
68 MPI_Dims_create(nprocs, 3, dims);
69 MPI_Comm cart_comm;
70 MPI_Cart_create(MPI_COMM_WORLD, 3, dims, periods, 0, &cart_comm);
71 MPI_Cart_coords(cart_comm, rank, 3, coords);
72 int xcoord = coords[2];
73 int ycoord = coords[1];
74 int zcoord = coords[0];
75
76 int nleft, nrght, nbot, ntop, nfrnt, nback;
77 MPI_Cart_shift(cart_comm, 2, 1, &nleft, &nrght);
78 MPI_Cart_shift(cart_comm, 1, 1, &nbot, &ntop);
79 MPI_Cart_shift(cart_comm, 0, 1, &nfrnt, &nback);GhostExchange/CartExchange3D_Neighbor/CartExchange.cc
65 int dims[3] = {nprocz, nprocy, nprocx};
66 int periods[3]={0,0,0};
67 int coords[3];
68 MPI_Dims_create(nprocs, 3, dims);
69 MPI_Comm cart_comm;
70 MPI_Cart_create(MPI_COMM_WORLD, 3, dims, periods, 0, &cart_comm);
71 MPI_Cart_coords(cart_comm, rank, 3, coords);
72 int xcoord = coords[2];
73 int ycoord = coords[1];
74 int zcoord = coords[0];
75
76 int nleft, nrght, nbot, ntop, nfrnt, nback;
77 MPI_Cart_shift(cart_comm, 2, 1, &nleft, &nrght);
78 MPI_Cart_shift(cart_comm, 1, 1, &nbot, &ntop);
79 MPI_Cart_shift(cart_comm, 0, 1, &nfrnt, &nback);
如果我们将此代码与清单 8.19 和 8.23 中的版本进行比较,我们会看到拓扑函数并没有节省大量代码行,也没有大大降低这个相对简单的设置示例中的编程复杂性。我们还可以利用清单 8.28 的第 70 行中创建的笛卡尔通信器来执行邻居通信。这就是代码行数减少最大的地方。MPI 函数具有以下参数:
If we compare this code to the versions in listing 8.19 and 8.23, we see that the topology functions do not save a lot of lines of code or greatly reduce the programming complexity in this relatively simple example for the setup. We can also leverage the Cartesian communicator created in line 70 of listing 8.28 to do the neighbor communication as well. That is where the greatest reduction of lines of code is seen. The MPI function has the following arguments:
int MPI_Neighbor_alltoallw(const void *sendbuf,
const int sendcounts[],
const MPI_Aint sdispls[],
const MPI_Datatype sendtypes[],
void *recvbuf,
const int recvcounts[],
const MPI_Aint rdispls[],
const MPI_Datatype recvtypes[],
MPI_Comm comm)int MPI_Neighbor_alltoallw(const void *sendbuf,
const int sendcounts[],
const MPI_Aint sdispls[],
const MPI_Datatype sendtypes[],
void *recvbuf,
const int recvcounts[],
const MPI_Aint rdispls[],
const MPI_Datatype recvtypes[],
MPI_Comm comm)
neighbor 调用中有很多参数,但是一旦我们把这些都设置好了,通信就会很简洁,并且在单个语句中完成。我们将详细讨论所有论点,因为这些论点可能很难正确。
There are a lot of arguments in the neighbor call, but once we get these all set up, the communication is concise and done in a single statement. We’ll go over all the arguments in detail because these can be difficult to get right.
邻居通信调用可以使用已填充的缓冲区进行发送和接收,也可以就地执行操作。我们将展示 in-place 方法。发送和接收缓冲区是 2D x 数组。我们将使用 MPI 数据类型来描述数据块,因此计数将是一个数组,对于 2D 的所有四个笛卡尔边,其值为 1,对于 3D 的所有 6 个边。侧面的通信顺序为 2D 的 bottom、top、left、right,3D 的 front、back、bottom、top、left、right,发送和接收类型相同。
The neighbor communication call can use either a filled buffer for the sends and receives or do the operation in place. We’ll show the in-place method. The send and receive buffers are the 2D x array. We will use an MPI data type to describe the data block, so the counts will be an array with the value of one for all four Cartesian sides for 2D or six sides for 3D. The order of the communication for the sides is bottom, top, left, right for 2D and front, back, bottom, top, left, right for 3D, and is the same for both send and receive types.
每个方向的数据块都不同:水平、垂直和深度。我们使用标准透视图的约定,x 向右,y 向上,z(深度)回到页面。但是在每个方向内,数据块是相同的,但与数据块起点的位移不同。位移以字节为单位,这就是为什么您将看到偏移量乘以 8,即双精度值的数据类型大小。现在让我们看看如何将所有这些放入代码中,以便为下面的清单中的 2D 情况设置通信。
The data block is different for each direction: horizontal, vertical, and depth. We use the convention of standard perspective drawings with x going to the right, y upwards, and z (the depth) going back into the page. But within each direction, the data block is the same but with different displacements to the start of the data block. The displacements are in bytes, which is why you will see the offsets multiplied by 8, the data type size of a double-precision value. Now let’s look at how all this gets put into code for the setup of the communication for the 2D case in the following listing.
Listing 8.29 2D Cartesian neighbor communication setup
GhostExchange/CartExchange_Neighbor/CartExchange.c 55 int ibegin = imax *(coords[1] )/dims[1]; ❶ 56 int iend = imax *(coords[1]+1)/dims[1]; ❶ 57 int isize = iend - ibegin; ❶ 58 int jbegin = jmax *(coords[0] )/dims[0]; ❶ 59 int jend = jmax *(coords[0]+1)/dims[0]; ❶ 60 int jsize = jend - jbegin; ❶ 61 62 int jlow=nhalo, jhgh=jsize+nhalo, ❷ ilow=nhalo, inum = isize; ❷ 63 if (corners) { ❷ 64 int ilow = 0, inum = isize+2*nhalo; ❷ 65 if (nbot == MPI_PROC_NULL) jlow = 0; ❷ 66 if (ntop == MPI_PROC_NULL) jhgh = jsize+2*nhalo; ❷ 67 } ❷ 68 int jnum = jhgh-jlow; ❷ 69 70 int array_sizes[] = {jsize+2*nhalo, isize+2*nhalo}; 71 72 int subarray_sizes_x[] = {jnum, nhalo}; ❸ 73 int subarray_horiz_start[] = {jlow, 0}; ❸ 74 MPI_Datatype horiz_type; ❸ 75 MPI_Type_create_subarray (2, array_sizes, ❸ subarray_sizes_x, subarray_horiz_start, ❸ 76 MPI_ORDER_C, MPI_DOUBLE, &horiz_type); ❸ 77 MPI_Type_commit(&horiz_type); ❸ 78 79 int subarray_sizes_y[] = {nhalo, inum}; ❹ 80 int subarray_vert_start[] = {0, jlow}; ❹ 81 MPI_Datatype vert_type; ❹ 82 MPI_Type_create_subarray (2, array_sizes, ❹ subarray_sizes_y, subarray_vert_start, ❹ 83 MPI_ORDER_C, MPI_DOUBLE, &vert_type); ❹ 84 MPI_Type_commit(&vert_type); ❹ 85 86 MPI_Aint sdispls[4] = { nhalo *(isize+2*nhalo)*8, ❺ 87 jsize *(isize+2*nhalo)*8, ❼ 88 nhalo *8, ❽ 89 isize *8}; ❾ 90 MPI_Aint rdispls[4] = { 0, ❿ 91 (jsize+nhalo) *(isize+2*nhalo)*8, ⓫ 92 0, ⓬ 93 (isize+nhalo)*8}; ⓭ 94 MPI_Datatype sendtypes[4] = {vert_type, ⓮ vert_type, horiz_type, horiz_type}; ⓮ 95 MPI_Datatype recvtypes[4] = {vert_type, ⓯ vert_type, horiz_type, horiz_type}; ⓯
GhostExchange/CartExchange_Neighbor/CartExchange.c 55 int ibegin = imax *(coords[1] )/dims[1]; ❶ 56 int iend = imax *(coords[1]+1)/dims[1]; ❶ 57 int isize = iend - ibegin; ❶ 58 int jbegin = jmax *(coords[0] )/dims[0]; ❶ 59 int jend = jmax *(coords[0]+1)/dims[0]; ❶ 60 int jsize = jend - jbegin; ❶ 61 62 int jlow=nhalo, jhgh=jsize+nhalo, ❷ ilow=nhalo, inum = isize; ❷ 63 if (corners) { ❷ 64 int ilow = 0, inum = isize+2*nhalo; ❷ 65 if (nbot == MPI_PROC_NULL) jlow = 0; ❷ 66 if (ntop == MPI_PROC_NULL) jhgh = jsize+2*nhalo; ❷ 67 } ❷ 68 int jnum = jhgh-jlow; ❷ 69 70 int array_sizes[] = {jsize+2*nhalo, isize+2*nhalo}; 71 72 int subarray_sizes_x[] = {jnum, nhalo}; ❸ 73 int subarray_horiz_start[] = {jlow, 0}; ❸ 74 MPI_Datatype horiz_type; ❸ 75 MPI_Type_create_subarray (2, array_sizes, ❸ subarray_sizes_x, subarray_horiz_start, ❸ 76 MPI_ORDER_C, MPI_DOUBLE, &horiz_type); ❸ 77 MPI_Type_commit(&horiz_type); ❸ 78 79 int subarray_sizes_y[] = {nhalo, inum}; ❹ 80 int subarray_vert_start[] = {0, jlow}; ❹ 81 MPI_Datatype vert_type; ❹ 82 MPI_Type_create_subarray (2, array_sizes, ❹ subarray_sizes_y, subarray_vert_start, ❹ 83 MPI_ORDER_C, MPI_DOUBLE, &vert_type); ❹ 84 MPI_Type_commit(&vert_type); ❹ 85 86 MPI_Aint sdispls[4] = { nhalo *(isize+2*nhalo)*8, ❺ 87 jsize *(isize+2*nhalo)*8, ❼ 88 nhalo *8, ❽ 89 isize *8}; ❾ 90 MPI_Aint rdispls[4] = { 0, ❿ 91 (jsize+nhalo) *(isize+2*nhalo)*8, ⓫ 92 0, ⓬ 93 (isize+nhalo)*8}; ⓭ 94 MPI_Datatype sendtypes[4] = {vert_type, ⓮ vert_type, horiz_type, horiz_type}; ⓮ 95 MPI_Datatype recvtypes[4] = {vert_type, ⓯ vert_type, horiz_type, horiz_type}; ⓯
❶ Calculates the global begin and end indices and the local array size
❷ Includes the corner values if these are requested
❸ 使用 subarray 函数创建数据块以在水平方向上通信
❸ Creates the data block to communicate in the horizontal direction using the subarray function
❹ 使用 subarray 函数创建数据块以在垂直方向上进行通信
❹ Creates the data block to communicate in the vertical direction using the subarray function
❺ Bottom row is nhalo above start.
❻ Displacements are from bottom left corner of memory block in bytes.
❼ Top row is jsize above start.
❽ Left column is nhalo right of start.
❾ Right column is isize right of start.
❿ Bottom ghost row is 0 above start.
⓫ 顶部幽灵行是 start 上方的 jsize+nhalo。
⓫ Top ghost row is jsize+nhalo above start.
⓬ Left ghost column is 0 right of start.
⓭ 右边的 ghost 行是 jsize+nhalo right。
⓭ Right ghost row is jsize+nhalo right of start.
⓮ Send types are ordered bottom, top, left, and right neighbors.
⓯ Receive types are ordered bottom, top, left, and right neighbors.
3D 笛卡尔邻域通信的设置使用清单 8.25 中的 MPI 数据类型。数据类型定义了要移动的数据块,但我们需要定义到发送和接收数据块开始位置的偏移量(以字节为单位)。我们还需要按照正确的顺序定义 sendtypes 和 recvtypes 的数组,如下一个清单所示。
The setup for the 3D Cartesian neighbor communication uses the MPI data types from listing 8.25. The data types define the block of data to be moved, but we need to define the offset in bytes to the start location of the data block for the send and receive. We also need to define the arrays for the sendtypes and recvtypes in the proper order as in the next listing.
Listing 8.30 3D Cartesian neighbor communication setup
GhostExchange/CartExchange3D_Neighbor/CartExchange.c
154 int xyplane_mult = (jsize+2*nhalo)*(isize+2*nhalo)*8;
155 int xstride_mult = (isize+2*nhalo)*8;
156 MPI_Aint sdispls[6] = {
nhalo *xyplane_mult, ❶❷
157 ksize *xyplane_mult, ❸
158 nhalo *xstride_mult, ❹
159 jsize *xstride_mult, ❺
160 nhalo *8, ❻
161 isize *8}; ❼
162 MPI_Aint rdispls[6] = {
0, ❽
163 (ksize+nhalo) *xyplane_mult, ❾
164 0, ❿
165 (jsize+nhalo) *xstride_mult, ⓫
166 0, ⓬
167 (isize+nhalo)*8}; ⓭
168 MPI_Datatype sendtypes[6] = { ⓮
depth_type, depth_type, ⓮
vert_type, vert_type, ⓮
horiz_type, horiz_type}; ⓮
169 MPI_Datatype recvtypes[6] = { ⓮
depth_type, depth_type, ⓮
vert_type, vert_type, ⓮
horiz_type, horiz_type}; ⓮GhostExchange/CartExchange3D_Neighbor/CartExchange.c
154 int xyplane_mult = (jsize+2*nhalo)*(isize+2*nhalo)*8;
155 int xstride_mult = (isize+2*nhalo)*8;
156 MPI_Aint sdispls[6] = {
nhalo *xyplane_mult, ❶❷
157 ksize *xyplane_mult, ❸
158 nhalo *xstride_mult, ❹
159 jsize *xstride_mult, ❺
160 nhalo *8, ❻
161 isize *8}; ❼
162 MPI_Aint rdispls[6] = {
0, ❽
163 (ksize+nhalo) *xyplane_mult, ❾
164 0, ❿
165 (jsize+nhalo) *xstride_mult, ⓫
166 0, ⓬
167 (isize+nhalo)*8}; ⓭
168 MPI_Datatype sendtypes[6] = { ⓮
depth_type, depth_type, ⓮
vert_type, vert_type, ⓮
horiz_type, horiz_type}; ⓮
169 MPI_Datatype recvtypes[6] = { ⓮
depth_type, depth_type, ⓮
vert_type, vert_type, ⓮
horiz_type, horiz_type}; ⓮
❶ Front is nhalo behind front.
❷ Displacements are from bottom left corner of memory block in bytes.
❹ Bottom row is nhalo above start.
❺ Top row is jsize above start.
❻ Left column is nhalo right of start.
❽ Front ghost is 0 from front.
❾ Back ghost is ksize+nhalo behind front.
❿ Bottom ghost row is 0 above start.
⓫ 顶部幽灵行是 start 上方的 jsize+nhalo。
⓫ Top ghost row is jsize+nhalo above start.
⓬ Left ghost column is 0 right of start.
⓭ 右边的 ghost 行是 jsize+nhalo right。
⓭ Right ghost row is jsize+nhalo right of start.
⓮ Send and receive types are ordered front, back, bottom, top, left, and right.
实际的通信是通过对 MPI_Neighbor_alltoallw 的一次调用完成的,如清单 8.31 所示。还有第二个代码块用于 corner 情况,它需要几次调用,中间进行同步,以确保正确填充 corner。第一个调用仅执行水平方向,然后等待完成,然后再执行垂直方向。
The actual communication is done with a single call to the MPI_Neighbor_alltoallw as shown in listing 8.31. There is also a second block of code for the corner cases that requires a couple of calls with a synchronization in between to ensure the corners are properly filled. The first call does only the horizontal direction and then waits for completion before doing the vertical direction.
Listing 8.31 2D Cartesian neighbor communication
GhostExchange/CartExchange_Neighbor/CartExchange.c
224 if (corners) {
225 int counts1[4] = {0, 0, 1, 1}; ❶
226 MPI_Neighbor_alltoallw ( ❷
&x[-nhalo][-nhalo], counts1, ❷
sdispls, sendtypes, ❷
227 &x[-nhalo][-nhalo], counts1, ❷
rdispls, recvtypes, ❷
228 cart_comm); ❷
229
230 int counts2[4] = {1, 1, 0, 0}; ❸
231 MPI_Neighbor_alltoallw ( ❸
&x[-nhalo][-nhalo], counts2, ❸
sdispls, sendtypes, ❹
232 &x[-nhalo][-nhalo], counts2, ❸
rdispls, recvtypes, ❹
233 cart_comm); ❹
234 } else {
235 int counts[4] = {1, 1, 1, 1}; ❺
236 MPI_Neighbor_alltoallw ( ❻
&x[-nhalo][-nhalo], counts, ❻
sdispls, sendtypes, ❻
237 &x[-nhalo][-nhalo], counts, ❻
rdispls, recvtypes, ❻
238 cart_comm); ❻
239 }GhostExchange/CartExchange_Neighbor/CartExchange.c
224 if (corners) {
225 int counts1[4] = {0, 0, 1, 1}; ❶
226 MPI_Neighbor_alltoallw ( ❷
&x[-nhalo][-nhalo], counts1, ❷
sdispls, sendtypes, ❷
227 &x[-nhalo][-nhalo], counts1, ❷
rdispls, recvtypes, ❷
228 cart_comm); ❷
229
230 int counts2[4] = {1, 1, 0, 0}; ❸
231 MPI_Neighbor_alltoallw ( ❸
&x[-nhalo][-nhalo], counts2, ❸
sdispls, sendtypes, ❹
232 &x[-nhalo][-nhalo], counts2, ❸
rdispls, recvtypes, ❹
233 cart_comm); ❹
234 } else {
235 int counts[4] = {1, 1, 1, 1}; ❺
236 MPI_Neighbor_alltoallw ( ❻
&x[-nhalo][-nhalo], counts, ❻
sdispls, sendtypes, ❻
237 &x[-nhalo][-nhalo], counts, ❻
rdispls, recvtypes, ❻
238 cart_comm); ❻
239 }
❶ Set counts to 1 for the horizontal direction
❸ Sets counts to 1 for the vertical direction
❺ Sets all the counts to 1 for all the directions
❻ All the neighbor communication is done in one call.
3D 笛卡尔邻域通信类似,但增加了 z 坐标(深度)。深度在 counts 和 types 数组中排在第一位。在角落的分阶段通信中,深度出现在水平和垂直 ghost cell 交换之后,如下一个清单所示。
The 3D Cartesian neighbor communication is similar but with the addition of the z coordinate (depth). The depth comes first in the counts and types arrays. In the phased communication for corners, the depth comes after horizontal and vertical ghost cell exchanges as the next listing shows.
Listing 8.32 3D Cartesian neighbor communication
GhostExchange/CartExchange3D_Neighbor/CartExchange.c
346 if (corners) {
347 int counts1[6] = {0, 0, 0, 0, 1, 1}; ❶
348 MPI_Neighbor_alltoallw( ❶
&x[-nhalo][-nhalo][-nhalo], counts1, ❶
sdispls, sendtypes, ❶
349 &x[-nhalo][-nhalo][-nhalo], counts1, ❶
rdispls, recvtypes, ❶
350 cart_comm); ❶
351
352 int counts2[6] = {0, 0, 1, 1, 0, 0}; ❷
353 MPI_Neighbor_alltoallw( ❷
&x[-nhalo][-nhalo][-nhalo], counts2, ❷
sdispls, sendtypes, ❷
354 &x[-nhalo][-nhalo][-nhalo], counts2, ❷
rdispls, recvtypes, ❷
355 cart_comm); ❷
356
357 int counts3[6] = {1, 1, 0, 0, 0, 0}; ❸
358 MPI_Neighbor_alltoallw( ❸
&x[-nhalo][-nhalo][-nhalo], counts3, ❸
sdispls, sendtypes, ❸
359 &x[-nhalo][-nhalo][-nhalo], counts3, ❸
rdispls, recvtypes, ❸
360 cart_comm); ❸
361 } else {
362 int counts[6] = {1, 1, 1, 1, 1, 1}; ❹
363 MPI_Neighbor_alltoallw( ❹
&x[-nhalo][-nhalo][-nhalo], counts, ❹
sdispls, sendtypes, ❹
364 &x[-nhalo][-nhalo][-nhalo], counts, ❹
rdispls, recvtypes, ❹
365 cart_comm); ❹
366 }GhostExchange/CartExchange3D_Neighbor/CartExchange.c
346 if (corners) {
347 int counts1[6] = {0, 0, 0, 0, 1, 1}; ❶
348 MPI_Neighbor_alltoallw( ❶
&x[-nhalo][-nhalo][-nhalo], counts1, ❶
sdispls, sendtypes, ❶
349 &x[-nhalo][-nhalo][-nhalo], counts1, ❶
rdispls, recvtypes, ❶
350 cart_comm); ❶
351
352 int counts2[6] = {0, 0, 1, 1, 0, 0}; ❷
353 MPI_Neighbor_alltoallw( ❷
&x[-nhalo][-nhalo][-nhalo], counts2, ❷
sdispls, sendtypes, ❷
354 &x[-nhalo][-nhalo][-nhalo], counts2, ❷
rdispls, recvtypes, ❷
355 cart_comm); ❷
356
357 int counts3[6] = {1, 1, 0, 0, 0, 0}; ❸
358 MPI_Neighbor_alltoallw( ❸
&x[-nhalo][-nhalo][-nhalo], counts3, ❸
sdispls, sendtypes, ❸
359 &x[-nhalo][-nhalo][-nhalo], counts3, ❸
rdispls, recvtypes, ❸
360 cart_comm); ❸
361 } else {
362 int counts[6] = {1, 1, 1, 1, 1, 1}; ❹
363 MPI_Neighbor_alltoallw( ❹
&x[-nhalo][-nhalo][-nhalo], counts, ❹
sdispls, sendtypes, ❹
364 &x[-nhalo][-nhalo][-nhalo], counts, ❹
rdispls, recvtypes, ❹
365 cart_comm); ❹
366 }
让我们在测试系统上尝试这些 ghost cell 交换变体。我们将使用两个 Broadwell 节点(Intel® Xeon® CPU E5-2695 v4,频率为 2.10GHz),每个节点具有 72 个虚拟内核。我们可以在更多具有不同 MPI 库实现、halo 大小、网格大小和更高性能通信互连的计算节点上运行它,以便更全面地了解每个 ghost cell 交换变体的性能。这是代码:
Let’s try out these ghost cell exchange variants on a test system. We’ll use two Broadwell nodes (Intel® Xeon® CPU E5-2695 v4 at 2.10GHz) with 72 virtual cores each. We could run this on more compute nodes with different MPI library implementations, halo sizes, mesh sizes, and with higher performance communication interconnects for a more comprehensive view of how each ghost cell exchange variant performs. Here’s the code:
mpirun -n 144 --bind-to hwthread ./GhostExchange -x 12 -y 12 -i 20000 \
-j 20000 -h 2 -t -c
mpirun -n 144 --bind-to hwthread ./GhostExchange -x 6 -y 4 -z 6 -i 700 \
-j 700 -k 700 -h 2 -t -cmpirun -n 144 --bind-to hwthread ./GhostExchange -x 12 -y 12 -i 20000 \
-j 20000 -h 2 -t -c
mpirun -n 144 --bind-to hwthread ./GhostExchange -x 6 -y 4 -z 6 -i 700 \
-j 700 -k 700 -h 2 -t -c
The options to the GhostExchange program are
两种或多种并行化技术的组合称为混合并行化,与所有也称为纯 MPI 或 MPI-everywhere 的 MPI 实现相反。在本节中,我们将介绍混合 MPI 和 OpenMP,其中 MPI 和 OpenMP 在应用程序中一起使用。这通常相当于用 OpenMP 线程替换一些 MPI 等级。对于涉及数千个进程的大型并行应用程序,将 MPI 列替换为 OpenMP 线程可能会减少 MPI 域的总大小和极大规模所需的内存。但是,线程级并行层的附加性能可能并不总是值得增加的复杂性和开发时间。因此,混合 MPI 和 OpenMP 实施通常是大小和性能需求方面的极端应用程序的领域。
The combination of two or more parallelization techniques is called a hybrid parallelization, in contrast to all MPI implementations that are also called pure MPI or MPI-everywhere. In this section, we’ll look at Hybrid MPI plus OpenMP, where MPI and OpenMP are used together in an application. This usually amounts to replacing some MPI ranks with OpenMP threads. For larger parallel applications reaching into thousands of processes, replacing MPI ranks with OpenMP threads potentially reduces the total size of the MPI domain and the memory needed for extreme scale. However, the added performance of the thread-level parallelism layer might not always be worth the added complexity and development time. For this reason, hybrid MPI plus OpenMP implementations are normally the domain of extreme applications in both size and performance needs.
当性能变得足够关键以应对混合并行性增加的复杂性时,向基于 MPI 的代码添加 OpenMP 并行层可能有几个优势。例如,这些优势可能是
When performance becomes critical enough for the added complexity of hybrid parallelism, there can be several advantages of adding an OpenMP parallel layer to MPI-based code. For example, these advantages might be
当您添加线程级并行性时,使用具有 ghost (halo) 单元的子域的空间分解并行应用程序将减少每个节点的 ghost 单元总数。这降低了内存需求和通信成本,尤其是在 Intel 的 Knights Landing (KNL) 等多核架构上。使用共享内存并行性还可以避免不必要地复制 MPI 用于节点消息的数据,从而减少对网络接口卡 (NIC) 的争用,从而提高性能。此外,许多 MPI 算法是基于树的,缩放为 log2n。将运行时间减少 2n 个线程会减少树的深度并逐渐提高性能。虽然其余工作仍必须由线程完成,但它会减少同步和通信延迟成本,从而影响性能。线程还可用于改善 NUMA 区域或计算节点内的负载均衡。
Spatially-decomposed parallel applications using subdomains with ghost (halo) cells will have fewer total ghost cells per node when you add thread-level parallelism. This leads to a reduction in both memory requirements and communication costs, especially on a many-core architecture like Intel’s Knights Landing (KNL). Using shared-memory parallelism can also improve performance by reducing contention for the network interface card (NIC) by avoiding the unnecessary copying of data used by MPI for on-node messages. Additionally, many MPI algorithms are tree-based, scaling as log2n. Reducing the run time by 2n threads decreases the depth of the tree and incrementally improves performance. While the remaining work still has to be done by threads, it impacts performance by allowing less synchronization and communication latency costs. Threads can also be used to improve load balance within a NUMA region or a compute node.
在某些情况下,混合并行方法不仅有利,而且对于充分发挥硬件性能潜力是必要的。例如,某些硬件,可能还有内存控制器功能,只能由线程访问,而不能由进程(MPI 等级)访问。Intel 的 Knights Corner 和 Knights Landing 架构的多核架构存在这些问题。在 MPI + X + Y 中,X 是线程,Y 是 GPU 语言,我们经常将排名与 GPU 的数量相匹配。OpenMP 允许应用程序继续访问其他处理器以进行 CPU 工作。还有其他解决方案,例如 MPI_COMM 组和 MPI 共享内存功能,或者简单地从多个 MPI 等级驱动 GPU。
In some cases, a hybrid parallel approach is not only advantageous, but necessary to access the full hardware performance potential. For example, some hardware, and perhaps memory controller functionality, can only be accessed by threads and not processes (MPI ranks). The many-core architectures of Intel’s Knights Corner and Knights Landing architectures have had these concerns. In MPI + X + Y, where X is threading and Y is a GPU language, we often match the ranks to the number of GPUs. OpenMP allows the application to continue to access the other processors for on-CPU work. There are other solutions to this, such as MPI_COMM groups and MPI-shared memory functionality or simply driving the GPU from multiple MPI ranks.
总之,虽然在现代众核系统上运行无处不在的 MPI 代码可能很有吸引力,但随着内核数量的增加,人们会担心可扩展性。如果您正在寻找极高的可扩展性,您将希望在您的应用程序中高效实施 OpenMP。我们在上一章的第 7.2.2 节和第 7.6 节中介绍了效率更高的高级 OpenMP 设计。
In summary, while it can be attractive to run codes with MPI-everywhere on modern many-core systems, there are concerns about scalability as the number of cores grows. If you are looking for extreme scalability, you will want an efficient implementation of OpenMP in your application. We covered our design of high-level OpenMP that is much more efficient in the previous chapter in sections 7.2.2 and 7.6.
MPI 和 OpenMP 混合实施的第一步是让 MPI 知道您将要做什么。这是在程序开始时的 MPI_Init 调用中完成的。您应该将 MPI_Init 调用替换为 MPI_Init_thread 调用,如下所示:
The first steps to a hybrid MPI plus OpenMP implementation is to let MPI know what you will be doing. This is done in the MPI_Init call right at the beginning of the program. You should replace the MPI_Init call with the MPI_Init_thread call like this:
MPI_Init_thread(&argc, &argv, int thread_model required,
int *thread_model_provided);MPI_Init_thread(&argc, &argv, int thread_model required,
int *thread_model_provided);
MPI 标准定义了四种线程模型。这些模型通过 MPI 调用提供不同级别的线程安全。按线程安全性的升序排列:
The MPI standard defines four thread models. These models give different levels of thread safety with the MPI calls. In increasing order of thread safety:
MPI_THREAD_SINGLE—Only one thread is executed (standard MPI)
MPI_THREAD_FUNNELED—Multithreaded but only the main thread makes MPI calls
MPI_THREAD_SERIALIZED—Multithreaded but only one thread at a time makes MPI calls
MPI_THREAD_MULTIPLE—Multithreaded with multiple threads making MPI calls
许多应用程序在主循环级别执行通信,而 OpenMP 线程应用于关键计算循环。对于此模式,MPI_THREAD_FUNNELED 效果很好。
Many applications perform communication at the main loop level, and OpenMP threads are applied to key computational loops. For this pattern, MPI_THREAD_FUNNELED works just fine.
注意最好使用您需要的最低级别的线程安全。每个更高的级别都会造成性能损失,因为 MPI 库必须在发送和接收队列以及 MPI 的其他基本部分周围放置互斥锁或关键块。
Note It’s best to use the lowest level of thread safety that you need. Each higher level imposes a performance penalty because the MPI library has to place mutexes or critical blocks around send and receive queues and other basic parts of MPI.
现在,让我们看看需要对模板示例进行哪些更改才能添加 OpenMP 线程。我们选择了 CartExchange_Neighbor 示例进行修改,以便进行本练习。下面的清单显示第一个更改是修改 MPI 初始化。
Now let’s see what changes are needed to our stencil example to add OpenMP threading. We chose the CartExchange_Neighbor example to modify for this exercise. The following listing shows that the first change is to modify the MPI initialization.
Listing 8.33 MPI initialization for OpenMP threading
HybridMPIPlusOpenMP/CartExchange.cc 26 int provided; 27 MPI_Init_thread(&argc, &argv, ❶ MPI_THREAD_FUNNELED, &provided); ❶ 28 29 int rank, nprocs; 30 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 31 MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 32 if (rank == 0) { 33 #pragma omp parallel 34 #pragma omp master 35 printf("requesting MPI_THREAD_FUNNELED” ❷ " with %d threads\n", ❷ 36 omp_get_num_threads()); ❷ 37 if (provided != MPI_THREAD_FUNNELED){ ❸ 38 printf("Error: MPI_THREAD_FUNNELED” " not available. Aborting ...\n"); 39 MPI_Finalize(); 40 exit(0); 41 } 42 }
HybridMPIPlusOpenMP/CartExchange.cc 26 int provided; 27 MPI_Init_thread(&argc, &argv, ❶ MPI_THREAD_FUNNELED, &provided); ❶ 28 29 int rank, nprocs; 30 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 31 MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 32 if (rank == 0) { 33 #pragma omp parallel 34 #pragma omp master 35 printf("requesting MPI_THREAD_FUNNELED” ❷ " with %d threads\n", ❷ 36 omp_get_num_threads()); ❷ 37 if (provided != MPI_THREAD_FUNNELED){ ❸ 38 printf("Error: MPI_THREAD_FUNNELED” " not available. Aborting ...\n"); 39 MPI_Finalize(); 40 exit(0); 41 } 42 }
❶ MPI initialization for OpenMP threading
❷ Prints number of threads to check if what we want
❸ Checks if this MPI supports our requested thread safety level
强制性更改是在第 27 行使用 MPI_Init_thread 而不是 MPI_Init。附加代码检查请求的线程安全级别是否可用,如果不可用,则退出。我们还打印了 rank 0 的主线程上的线程数。现在,我们来看看下一个清单中所示的计算循环中的更改。
The mandatory change is using MPI_Init_thread instead of MPI_Init on line 27. The additional code checks that the requested thread safety level is available and exits if it is not. We also print the number of threads on the main thread of rank zero. Now onto the changes in the computational loop shown in the next listing.
列表 8.34 在计算循环中添加 OpenMP 线程和向量化
Listing 8.34 Addition of OpenMP threading and vectorization to computational loops
HybridMPIPlusOpenMP/CartExchange.cc 157 #pragma omp parallel for ❶ 158 for (int j = 0; j < jsize; j++){ 159 #pragma omp simd ❷ 160 for (int i = 0; i < isize; i++){ 161 xnew[j][i] = ( x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i] )/5.0; 162 } 163 }
HybridMPIPlusOpenMP/CartExchange.cc 157 #pragma omp parallel for ❶ 158 for (int j = 0; j < jsize; j++){ 159 #pragma omp simd ❷ 160 for (int i = 0; i < isize; i++){ 161 xnew[j][i] = ( x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i] )/5.0; 162 } 163 }
❶ Adds OpenMP threading for outer loop
❷ Adds SIMD vectorization for inner loop
添加 OpenMP 线程所需的更改是在第 157 行添加单个 pragma。作为奖励,我们将展示如何在第 159 行插入另一个 pragma 来为内部循环添加向量化。
The changes required to add OpenMP threading are the addition of a single pragma at line 157. As a bonus, we show how to add vectorization for the inner loop with another pragma inserted at line 159.
现在,您可以尝试在系统上运行此混合 MPI 和 OpenMP+向量化示例。但要获得良好的性能,您需要控制 MPI 等级和 OpenMP 线程的位置。这是通过设置 affinity 来完成的,我们将在第 14 章中更深入地介绍这个主题。
You can now try running this hybrid MPI plus OpenMP+Vectorization example on your system. But to get good performance, you will need to control the placement of the MPI ranks and the OpenMP threads. This is done by setting affinity, a topic that we will cover in greater depth in chapter 14.
定义 Affinity 为特定硬件组件分配进程、等级或线程调度的首选项。这也称为固定或绑定。
Definition Affinity assigns a preference for the scheduliing of a process, rank, or thread to a particular hardware component. This is also called pinning or binding.
随着节点复杂性的增加以及混合并行应用程序的出现,为您的 ranks 和 threads 设置关联性变得更加重要。在前面的示例中,我们使用 —bind-to core 和 —bind-to hwthread 来提高性能并减少因等级从一个内核迁移到另一个内核而导致的运行时性能变化。在 OpenMP 中,我们使用环境变量来设置位置和亲和力。一个例子是
Setting the affinity for your ranks and threads becomes more important as the complexity of the node increases and with hybrid parallel applications. In earlier examples, we used —bind-to core and —bind-to hwthread to improve performance and reduce variability in run-time performance caused by ranks migrating from one core to another. In OpenMP, we used environment variables to set placement and affinities. An example is
export OMP_PLACES=cores export OMP_CPU_BIND=true
export OMP_PLACES=cores export OMP_CPU_BIND=true
现在,首先将 MPI 等级固定到套接字,以便线程可以扩展到其他内核,正如我们在 Skylake Gold 处理器的 ghost cell 测试示例中所示。方法如下:
For now, start with pinning the MPI ranks to sockets so that the threads can spread to other cores as we showed in our ghost cell test example for the Skylake Gold processor. Here’s how:
export OMP_NUM_THREADS=22 mpirun -n 4 --bind-to socket ./CartExchange -x 2 -y 2 -i 20000 -j 20000 \ -h 2 -t -c
export OMP_NUM_THREADS=22 mpirun -n 4 --bind-to socket ./CartExchange -x 2 -y 2 -i 20000 -j 20000 \ -h 2 -t -c
我们运行 4 个 MPI 秩,每个秩生成 OMP_NUM_THREADS 环境变量指定的 22 个线程,总共 88 个进程。mpirun 的 —bind-to 套接字选项指示它将进程绑定到放置这些进程的套接字。
We run 4 MPI ranks that each spawn 22 threads as specified by the OMP_NUM_THREADS environment variable for a total of 88 processes. The —bind-to socket option to mpirun tells it to bind the processes to the socket where these are placed.
尽管我们在本章中介绍了很多材料,但随着您对 MPI 的更多了解,还有更多功能值得探索。这里提到了一些最重要的,留给您自己学习。
Although we have covered a lot of material in this chapter, there are still many more features that are worth exploring as you get more experience with MPI. Some of the most important are mentioned here and left for your own study.
通信组 — MPI 具有一组丰富的函数,用于创建、拆分和以其他方式操作标准 MPI COMM_WORLD通信器为新的分组,以进行专业操作,例如行内的通信或基于任务的子组。有关使用 communicator 组的一些示例,请参见第 16.3 节中的清单 16.4。我们使用通信组将文件输出拆分为多个文件,并将域拆分为行和列通信器。
Comm groups—MPI has a rich set of functions that create, split, and otherwise manipulate the standard MPI COMM_WORLD communicator into new groupings for specialized operations like communication within a row or task-based subgroups. For some examples of the use of communicator groups, see listing 16.4 in section 16.3. We use communication groups to split the file output into multiple files and break the domain into row and column communicators.
非结构化网格边界通信 - 非结构化网格需要以与常规笛卡尔网格类似的方式交换边界数据。这些操作更复杂,此处不作介绍。有许多稀疏的、基于图形的通信库支持非结构化网格应用程序。这种库的一个例子是由 Richard Barrett 开发的 L7 通信库,现在在桑迪亚国家实验室。它包含在 CLAMR 迷你应用程序中;请参阅 https:// github.com/LANL/CLAMR 中的 L7 子目录。
Unstructured mesh boundary communications—An unstructured mesh needs to exchange boundary data in a similar manner to that covered for a regular, Cartesian mesh. These operations are more complex and not covered here. There are many sparse, graph-based communication libraries that support unstructured mesh applications. One example of such a library is the L7 communication library developed by Richard Barrett now at Sandia National Laboratories. It is included with the CLAMR mini-app; see the l7 subdirectory at https:// github.com/LANL/CLAMR.
共享内存 — 在几乎所有情况下,原始 MPI 实现都通过网络接口发送数据。随着内核数量的增加,MPI 开发人员意识到他们可以在共享内存中进行一些通信。这是在幕后作为通信优化完成的。MPI 共享内存“窗口”继续添加其他共享内存功能。此功能起初存在一些问题,但现在已经足够成熟,可以在应用程序中使用。
Shared memory—The original MPI implementations sent data over the network interface in nearly all cases. As the number of cores grew, MPI developers realized that they could do some of the communication in shared memory. This is done behind the scenes as a communication optimization. Additional shared memory functionality continues to be added with MPI shared memory “windows.” This functionality had some problems at first, but it is becoming mature enough to use in applications.
单侧通信 — 为了响应其他编程模型,MPI 以 MPI_Puts 和 MPI_Gets 的形式添加了单侧通信。与原始的 MPI 消息传递模型相反,在原始 MPI 消息传递模型中,发送方和接收方都必须是主动参与者,而单侧模型只允许其中一方执行操作。
One-sided communication—Responding to other programming models, MPI added one-sided communication in the form of MPI_Puts and MPI_Gets. Contrary to the original MPI message-passing model, where both the sender and receiver have to be active participants, the one-sided model allows just one or the other to conduct the operation.
如果您想了解有关 MPI 的更多介绍性材料,Peter Pacheco 的文字是经典之作:
If you want more introductory material on MPI, the text by Peter Pacheco is a classic:
Peter Pacheco,并行编程简介 (Elsevier,2011 年)。
Peter Pacheco, An introduction to parallel programming (Elsevier, 2011).
您可以找到由原始 MPI 开发团队的成员编写的 MPI 的全面报道:
You can find thorough coverage of MPI authored by members of the original MPI development team:
William Gropp 等人,“使用 MPI:使用消息传递接口的可移植并行编程”,第 1 卷(麻省理工学院出版社,1999 年)。
William Gropp, et al., “Using MPI: portable parallel programming with the message-passing interface,” Vol. 1 (MIT Press, 1999).
有关 MPI 和 OpenMP 的演示,Bill Gropp 的课程中有一个很好的讲座,Bill Gropp 是原始 MPI 标准的开发者之一。这是链接:
For a presentation of MPI plus OpenMP, there is a good lecture from a course by Bill Gropp, one of the developers of the original MPI standard. Here’s the link:
http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture36.pdf
http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture36.pdf
为什么我们不能像在虚影交换中的 send/receive 中分别使用清单 8.20 和 8.21 中的 pack 或 array buffer 方法那样阻止接收呢?
Why can’t we just block on receives as was done in the send/receive in the ghost exchange using the pack or array buffer methods in listings 8.20 and 8.21, respectively?
如清单 8.8 所示,在 ghost exchange 的 vector 类型版本中阻止 receive 是否安全?如果我们只阻止接收有什么好处?
Is it safe to block on receives as shown in listing 8.8 in the vector type version of the ghost exchange? What are the advantages if we only block on receives?
修改清单 8.21 中的 ghost cell exchange 向量类型示例,以使用阻塞接收而不是 waitall。它更快吗?它总是有效吗?
Modify the ghost cell exchange vector type example in listing 8.21 to use blocking receives instead of a waitall. Is it faster? Does it always work?
尝试将其中一个 ghost exchange 例程中的 explicit 标记替换为 MPI_ ANY_TAG。它有效吗?它更快吗?您认为使用显式标签有什么好处?
Try replacing the explicit tags in one of the ghost exchange routines with MPI_ ANY_TAG. Does it work? Is it any faster? What advantage do you see in using explicit tags?
在其中一个 ghost exchange 示例中消除 synchronized timers 的障碍。使用原始同步计时器和非同步计时器运行代码。
Remove the barriers for the synchronized timers in one of the ghost exchange examples. Run the code with the original synchronized timers and the unsynchronized timers.
Add the timer statistics from listing 8.11 to the stream triad bandwidth measurement code in listing 8.17.
应用本章(HybridMPIPlusOpenMP 目录)所附代码中的步骤,将高级 OpenMP 转换为混合 MPI 加 OpenMP 示例。在您的平台上尝试向量化、线程数和 MPI 排名。
Apply the steps to convert high-level OpenMP to the hybrid MPI plus OpenMP example in the code that accompanies the chapter (HybridMPIPlusOpenMP directory). Experiment with the vectorization, number of threads, and MPI ranks on your platform.
Use the proper send and receive point-to-point messages. This avoids hangs and gets good performance.
Use collective communication for common operations. This makes for concise programming, avoids hangs, and improves performance.
Use ghost exchanges to link together subdomains from various processors. The exchanges make the subdomains act as a single global computational mesh.
Add more levels of parallelism through combining MPI with OpenMP threads and vectorization. The additional parallelism helps give better performance.
以下有关 GPU 计算的章节讨论了使用 GPU 进行科学计算。主题包括
The following chapters on GPU computing discuss using GPUs for scientific computing. The topics include
In chapter 9, you’ll gain an understanding of the GPU architecture and its benefits for general-purpose computation.
In chapter 10, you’ll learn how to build a mental representation of the programming model for GPUs.
在第 11 章和第 12 章中,您将探索可用的 GPU 编程语言。在第 11 章中,我们介绍了 OpenACC 和 OpenMP 中的基本示例,在第 12 章中,我们涵盖了广泛的 GPU 语言,从 CUDA、OpenCL 和 HIP 等较低级别的原生语言到 SYCL、Kokkos 和 Raja 等高级原生语言。
In chapters 11 and 12, you’ll explore the available GPU programming languages. In chapter 11, we present basic examples in OpenACC and OpenMP, and in chapter 12, we cover a broad range of GPU languages, from lower level native languages like CUDA, OpenCL, and HIP to higher level ones like SYCL, Kokkos, and Raja.
In chapter 13, you’ll learn about profiling tools and developing a workflow model that enhances programmer productivity.
GPU 旨在加速计算。GPU 硬件开发人员一心一意地专注于提高计算机动画的帧速率,因此不遗余力地提高数值运算吞吐量。这些器件只是适用于任何大规模并行操作的通用极限加速器。它们被称为图形处理单元,因为它们是为该应用程序开发的。
GPUs were built to accelerate computation. With a single-minded focus on improving the frame rate for computer animation, GPU hardware developers went to great lengths to increase numerical operation throughput. These devices are simply general-purpose extreme accelerators for any massively parallel operation. They are called graphics processing units because they were developed for that application.
快进到今天,许多软件开发人员已经意识到 GPU 提供的加速同样适用于各种应用领域。2002 年,Mark Harris 在北卡罗来纳大学 (University of North Carolina) 就读期间,创造了通用图形处理单元 (GPGPU) 一词,试图捕捉 GPU 不仅仅适用于图形的理念。比特币挖矿、机器学习和高性能计算等 GPU 的主要市场如雨后春笋般涌现。对 GPU 硬件的微小修改(例如双精度浮点单元和张量运算)为这些市场定制了基本 GPU 设计。GPU 不再仅用于图形。
Fast forward to today, and many software developers have realized that the acceleration provided by GPUs is just as applicable for a wide variety of application domains. While at the University of North Carolina in 2002, Mark Harris coined the term general-purpose graphics processing units (GPGPUs) to try and capture the idea that GPUs are suitable for more than graphics alone. Major markets for GPUs have sprung up for bitcoin mining, machine learning, and high-performance computing. Small modifications to the GPU hardware, such as double-precision floating-point units and tensor operations, have customized the basic GPU designs for each of these markets. No longer are GPUs just for graphics.
发布的每个 GPU 模型都针对不同的细分市场。高端 GPU 型号的价格高达 10,000 美元,这些与您在大众市场计算机中找到的硬件型号不同。尽管 GPU 用于的应用程序比最初的预期要多得多,但仍然很难看到它们在不久的将来完全取代 CPU 的通用功能,因为单个操作更适合 CPU。
Each GPU model that is released targets a different market segment. With high-end GPU models commanding prices of up to $10,000, these are not the same hardware models that you will find in mass-market computers. Though GPUs are used in many more applications than they were originally intended for, it is still difficult to see them completely replacing the general-purpose functionality of a CPU within the near future, as a single operation is better suited for the CPU.
如果我们要为 GPU 想出一个新名称,我们将如何从广义上捕捉它们的功能?所有使用域的共同点是,为了使使用 GPU 有意义,必须可以同时完成大量工作。很多,我们指的是数千或数万个同时并行操作。GPU 实际上是并行加速器。也许我们应该将它们称为并行处理单元 (PPU) 或并行处理加速器 (PPA),以便更好地了解它们的功能。但是,我们将坚持使用 GPU 一词,并了解它们远不止于此。从这个角度看待 GPU,您就会明白为什么它们对并行计算社区如此重要。
If we were to come up with a new name for GPUs, how would we capture their functionality in a broad sense? The commonality across all the use domains is that, in order for it to make sense to use GPUs, there must be a lot of work that can be done simultaneously. And by a lot, we mean thousands or tens of thousands of simultaneous parallel operations. GPUs are really parallel accelerators. Maybe we should call them Parallel Processing Units (PPUs) or Parallel Processing Accelerators (PPAs) to capture a better idea of their functionality. But, we’ll stick with the term GPUs with the understanding that they are so much more. Seeing GPUs in this light, you can understand why they are of such importance for the parallel computing community.
GPU 是比 CPU 设计和制造更简单的设备,因此设计周期时间是 CPU 的一半。许多应用程序相对于 CPU 的 GPU 性能的交叉点约为 2012 年。从那时起,GPU 性能的改进速度约为 CPU 的两倍。粗略地说,今天的 GPU 可以提供比 CPU 快 10 倍的速度。当然,这种加速会因应用程序类型和代码实现的质量而产生很大差异。趋势很明显 — GPU 将继续为那些适合其大规模并行架构的应用程序提供更快的速度。
GPUs are simpler devices to design and manufacture than CPUs, so the design cycle time is half that of the CPUs. The crossover point for many applications for GPU performance relative to CPUs was about 2012. Since then, GPU performance has been improving at about twice the rate of CPUs. In very rough numbers, GPUs today can provide a ten times speedup over CPUs. Of course, there is a lot of variability in this speedup by the type of application and the quality of code implementation. The trends are clear—GPUs will continue to show greater speedups for those applications that can fit its massively parallel architecture.
为了帮助您了解这些新的硬件设备,我们将在第 10 章中介绍其硬件设计的基本部分。然后,我们尝试帮助您开发一个如何接近它们的心智模型。在处理 GPU 项目之前,获得这种理解非常重要。我们已经看到许多移植工作都失败了,因为程序员认为他们可以将最昂贵的循环移动到 GPU 上,他们会看到惊人的加速。然后,当他们的应用程序运行速度较慢时,他们就会放弃这项工作。将数据传输到 GPU 的成本很高;因此,必须将应用程序的大部分移植到设备上才能看到任何好处。在 GPU 实现之前,一个简单的性能模型和分析会降低程序员的初始期望,并警告他们计划足够的时间和精力来取得成功。
To help you understand these new hardware devices, we go over the essential parts of their hardware design in chapter 10. We then try to help you develop a mental model of how to approach them. It is important to gain this understanding before tackling a project for GPUs. We have seen numerous porting efforts fail because programmers thought they could just move the most expensive loop to the GPU and they would see fantastic speedups. Then when their application runs slower, they abandon the effort. Transferring data to the GPU is expensive; therefore, large parts of your application must be ported to the device to see any benefit. A simple performance model and analysis before the GPU implementation would have tempered programmers’ initial expectations and cautioned them to plan for sufficient time and effort to achieve success.
也许开始 GPU 实现的最大障碍是编程语言的不断变化的格局。似乎每隔几个月就会发布一种新语言。尽管这些语言带来了高度的创新,但这种不断的演变使应用程序开发人员感到困难。但是,仔细研究这些语言会发现,相似之处多于差异,通常集中在几个共同的设计上。虽然我们预计未来几年的语言抖动,但许多语言变体类似于方言,而不是完全不同的语言。在第 11 章中,我们介绍了基于 pragma 的语言。在第 12 章中,我们调查了原生 GPU 语言和一类严重利用 C++ 结构的新型性能可移植性语言。尽管我们提供了多种语言实现,但我们建议您首先选择几种语言,以获得一些实践经验。您决定尝试使用哪些语言在很大程度上取决于您拥有的硬件。
Perhaps the biggest impediment to beginning a GPU implementation is the constantly shifting landscape of programming languages. It seems like a new language gets released every few months. Although these languages bring a high degree of innovation, this constant evolution makes it difficult for application developers. However, taking a closer look at the languages shows that there are more similarities than differences, often converging on a couple of common designs. While we expect a few more years of language thrashing, many of the language variations are akin to dialects rather than completely different languages. In chapter 11, we cover the pragma-based languages. In chapter 12, we survey native GPU languages and a new class of performance portability languages that heavily leverage C++ constructs. Though we present a variety of language implementations, we suggest you initially pick a couple of languages to get some hands-on experience with those. Much of your decision on which languages to experiment with will be dependent on the hardware that you have readily available.
您可以在 https://github.com/EssentialsofParallelComputing 中查看每个章节的示例。(事实上,我们强烈建议您这样做。GPU 编程的障碍之一是访问硬件并正确设置它。安装支持 GPU 的系统软件有时可能很困难。这些示例包含供应商提供的 GPU 软件包列表,可以帮助您入门。但是您需要在系统中安装 GPU 的软件,并在示例中注释掉其余部分。这需要一些试验和错误。在第 13 章中,我们讨论了不同的工作流程和替代方案,例如设置 Docker 容器和虚拟机 (VM)。这些选项可能提供了一种在笔记本电脑或台式机上设置开发环境的方法,尤其是在您使用的是 Windows 或 MacOS 时。
You can check out the examples at https://github.com/EssentialsofParallelComputing for each of these chapters. (In fact, we highly encourage you to do so.) One of the barriers for GPU programming is access to hardware and getting it set up properly. The installation of system software to support the GPUs can sometimes be difficult. The examples have lists of the software packages for the GPUs from the vendors that can get you started. But you will want to install the software for the GPU in your system and, in the examples, comment out the rest. It will take some trial and error. In chapter 13, we discuss different workflows and alternatives such as setting up Docker containers and virtual machines (VMs). These options may provide a way to set up a development environment on your laptop or desktop, especially if you are using Windows or MacOS.
如果您没有可用的本地硬件,则可以试用带有 GPU 的云服务。有些,例如 Google Cloud(200-300 美元的信用额度)甚至提供免费试用。这些服务甚至具有市场附加组件,可让您使用 GPU 设置 HPC 集群。使用 GPU 的 HPC 云服务的一个示例是 Fluid Numerics Google Cloud Platform。为了测试 Intel GPU,Intel 还为试用服务设置了云服务。有关信息,我们推荐以下网站:
If you don’t have local hardware available, you can try out a cloud service with GPUs. Some such as Google cloud ($200-300 credit) even have free trials. These services even have marketplace add-ons that let you set up an HPC cluster with GPUs. One example of an HPC cloud service with GPUs is the Fluid Numerics Google Cloud Platform. For testing out Intel GPUs, Intel has set up cloud services for trial services as well. For information, we recommend these sites:
位于 https://console.cloud.google.com/mar ketplace/details/fluid-cluster-ops/fluid-slurm-gcp 的 Fluid-Slurm Google Cloud Cluster
Fluid-Slurm Google Cloud Cluster at https://console.cloud.google.com/mar ketplace/details/fluid-cluster-ops/fluid-slurm-gcp
https://software.intel.com/en-us/oneapi 的 Intel 云版本的 oneAPI 和 DPCPP(您必须注册才能使用)。
Intel cloud version of oneAPI and DPCPP at https://software.intel.com/en-us/oneapi (you must register to use).
为什么我们关心用于高性能计算的图形处理单元 (GPU)?GPU 提供了大量的并行操作源,可以大大超过更传统的 CPU 架构上可用的资源。要利用它们的能力,我们必须了解 GPU 架构。尽管 GPU 通常用于图形处理,但 GPU 也用于通用并行计算。本章概述了 GPU 加速平台上的硬件。
Why do we care about graphics processing units (GPUs) for high-performance computing? GPUs provide a massive source of parallel operations that can greatly exceed that which is available on the more conventional CPU architecture. To exploit their capabilities, it is essential that we understand GPU architectures. Though GPUs have often been used for graphical processing, GPUs are also used for general-purpose parallel computing. This chapter provides an overview of the hardware on a GPU-accelerated platform.
目前哪些系统是 GPU 加速的?几乎每个计算系统都提供了当今用户所期望的强大图形功能。这些 GPU 的范围从主 CPU 的小型组件到在台式机机箱中占用大部分空间的大型外围卡。HPC 系统越来越多地配备了多个 GPU。有时,即使是用于模拟或游戏的个人计算机有时也可以连接两个 GPU 以获得更高的图形性能。在本章中,我们提出了一个概念模型,用于识别 GPU 加速系统的关键硬件组件。图 9.1 显示了这些组件。
What systems today are GPU accelerated? Virtually every computing system provides the powerful graphics capabilities expected by today’s users. These GPUs range from small components of the main CPU to large peripheral cards taking up a large part of space in a desktop case. HPC systems are increasingly coming equipped with multiple GPUs. On occasion, even personal computers used for simulation or gaming can sometimes connect two GPUs for higher graphics performance. In this chapter, we present a conceptual model that identifies key hardware components of a GPU accelerated system. Figure 9.1 shows these components.
图 9.1 使用专用 GPU 的 GPU 加速系统框图。CPU 和 GPU 都有自己的内存。CPU 和 GPU 通过 PCI 总线进行通信。
Figure 9.1 Block diagram of GPU-accelerated system using a dedicated GPU. The CPU and GPU each have their own memory. The CPU and GPU communicate over a PCI bus.
由于社区中的术语不一致,理解 GPU 的复杂性增加了。我们将使用 OpenCL 标准建立的术语,因为它已得到多个 GPU 供应商的同意。我们还将记录常用的替代术语,例如 NVIDIA 使用的术语。在继续讨论之前,让我们看看一些定义:
Due to inconsistent terminology in the community, there is added complexity in understanding GPUs. We will use the terminology established by the OpenCL standard because it was agreed to by multiple GPU vendors. We will also note alternate terminology that is in common use, such as that used by NVIDIA. Let’s look at a few definitions before continuing our discussion:
CPU—The main processor that is installed in the socket of the motherboard.
CPU RAM — 插入主板内存插槽的包含动态随机存取存储器 (DRAM) 的“内存条”或双列直插式内存模块 (DIMM)。
CPU RAM—The “memory sticks” or dual in-line memory modules (DIMMs) containing Dynamic Random-Access Memory (DRAM) that are inserted into the memory slots in the motherboard.
GPU — 安装在主板上 Peripheral Component Interconnect Express (PCIe) 插槽中的大型外围卡。
GPU—A large peripheral card installed in a Peripheral Component Interconnect Express (PCIe) slot on the motherboard.
GPU RAM—Memory modules on the GPU peripheral card for exclusive use of the GPU.
PCI bus—The wiring that connects the peripheral cards to the other components on the motherboard.
我们将介绍 GPU 加速系统中的每个组件,并展示如何计算每个组件的理论性能。然后,我们将使用小型微基准测试应用程序检查它们的实际性能。这将有助于确定某些硬件组件如何导致瓶颈,从而阻止您使用 GPU 加速应用程序。有了这些信息,我们将在本章的结尾讨论从 GPU 加速中受益最大的应用程序类型,以及在将应用程序移植到在 GPU 上运行时实现性能提升的目标。在本章中,您可以在 https://github.com/EssentialsofParallelComputing/Chapter9 中找到源代码。
We’ll introduce each component in a GPU-accelerated system and show how to calculate the theoretical performance for each. We’ll then examine their actual performance with small micro-benchmark applications. This will help to establish how some hardware components can cause bottlenecks that prevent you from accelerating an application with GPUs. Armed with this information, we’ll conclude the chapter with a discussion of the types of applications that benefit most from GPU acceleration and what your goals should be to see performance gains when porting an application to run on GPUs. For this chapter, you’ll find the source code at https://github.com/EssentialsofParallelComputing/Chapter9.
GPU 无处不在。它们可以在手机、平板电脑、个人电脑、消费级工作站、游戏机、高性能计算中心和云计算平台中找到。GPU 在大多数现代硬件上提供额外的计算能力,并加速您甚至可能不知道的许多操作。顾名思义,GPU 是为与图形相关的计算而设计的。因此,GPU 设计侧重于并行处理大块数据(三角形或多边形),这是图形应用程序的要求。与可以在一个时钟周期内处理数十个并行线程或进程的 CPU 相比,GPU 能够同时处理数千个并行线程。由于这种设计,GPU 提供了相当高的理论峰值性能,这可能会缩短求解时间和减少应用程序的能源足迹。
GPUs are everywhere. They can be found in cell phones, tablets, personal computers, consumer-grade workstations, gaming consoles, high performance computing centers, and cloud computing platforms. GPUs provide additional compute power on most modern hardware and accelerate many operations you may not even be aware of. As the name suggests, GPUs were designed for graphics-related computations. Consequently, GPU design focuses on processing large blocks of data (triangles or polygons) in parallel, which is a requirement for graphics applications. Compared to CPUs that can handle tens of parallel threads or processes in a clock cycle, GPUs are capable of processing thousands of parallel threads simultaneously. Because of this design, GPUs offer a considerably higher theoretical peak performance that can potentially reduce the time to solution and the energy footprint of an application.
计算科学家一直在寻找计算能力,他们被使用 GPU 来执行更通用的计算任务所吸引。由于 GPU 是为图形设计的,因此最初为对其进行编程而开发的语言(如 OpenGL)侧重于图形操作。为了在 GPU 上实现算法,程序员必须根据这些操作重新构建他们的算法,这既耗时又容易出错。将图形处理器的使用扩展到非图形工作负载被称为通用图形处理单元 (GPGPU) 计算。
Computational scientists, always on the lookout for computational horsepower, were attracted to using GPUs to perform more general-purpose computing tasks. Because GPUs were designed for graphics, the languages originally developed to program them, like OpenGL, focused on graphics operations. To implement algorithms on GPUs, programmers had to reframe their algorithms in terms of these operations, which was time-consuming and error-prone. Extending the use of the graphics processor to non-graphics workloads became known as general-purpose graphics processing unit (GPGPU) computing.
GPGPU 计算的持续兴趣和成功导致了一系列 GPGPU 语言的引入。首先被广泛采用的是 NVIDIA GPU 的计算统一设备架构 (CUDA) 编程语言,该语言于 2007 年首次推出。占主导地位的开放标准 GPGPU 计算语言是开放计算语言 (OpenCL),由 Apple 领导的一组供应商开发并于 2009 年发布。我们将在第 12 章中介绍 CUDA 和 OpenCL。
The continued interest and success of GPGPU computing led to the introduction of a flurry of GPGPU languages. The first to gain wide adoption was the Compute Unified Device Architecture (CUDA) programming language for NVIDIA GPUs, which was first introduced in 2007. The dominant open standard GPGPU computing language is the Open Computing Language (OpenCL), developed by a group of vendors led by Apple and released in 2009. We’ll cover both CUDA and OpenCL in chapter 12.
尽管 GPGPU 语言不断引入,或者可能正因为如此,许多计算科学家发现原始的原生 GPGPU 语言很难使用。因此,使用基于指令的 API 的更高级别方法获得了大量追随者,并刺激了供应商的相应开发工作。我们将在第 11 章中介绍基于指令的语言示例,如 OpenACC 和 OpenMP(使用新的 target 指令)。现在,我们将新的基于指令的 GPGPU 语言 OpenACC 和 OpenMP 总结为无可争议的成功。这些语言和 API 使程序员能够更专注于开发他们的应用程序,而不是用图形操作来表达他们的算法。最终结果通常是科学和数据科学应用程序的巨大加速。
Despite the continual introduction of GPGPU languages, or maybe because of it, many computational scientists have found the original, native, GPGPU languages difficult to use. As a result, higher-level approaches using directive-based APIs gained a large following and spurred corresponding development efforts by vendors. We’ll cover examples of directive-based languages like OpenACC and OpenMP (with the new target directive) in chapter 11. For now, we summarize the new directive-based GPGPU languages, OpenACC and OpenMP, as an unqualified success. These languages and APIs have allowed programmers to focus more on developing their applications, rather than expressing their algorithm in terms of graphics operations. The end result has often been tremendous speedups in scientific and data science applications.
GPU 最好被描述为加速器,长期以来在计算领域使用。首先,我们来定义一下加速器的含义。
GPUs are best described as accelerators, long used in the computing world. First let’s define what we mean by an accelerator.
定义加速器(硬件)是一种特殊用途的设备,用于补充主要的通用 CPU,以加快某些操作的速度。
Definition An accelerator (hardware) is a special-purpose device that supplements the main general-purpose CPU in speeding up certain operations.
加速器的一个典型示例是配备 8088 CPU 的原始 PC。它具有 8087 协处理器的选项和插槽,可以在硬件而不是软件中执行浮点运算。今天,最常见的硬件加速器是图形处理器,它可以是一个单独的硬件组件,也可以集成在主处理器上。被称为加速器的区别在于它是一种特殊用途的设备,而不是通用设备,但这种区别并不总是明确的。GPU 是可以与 CPU 一起执行操作的附加硬件组件。GPU 有两种类型:
A classic example of an accelerator is the original PC that came with the 8088 CPU. It had the option and a socket for the 8087 coprocessor that would do floating-point operations in hardware rather than software. Today, the most common hardware accelerator is the graphics processor, which can be either a separate hardware component or integrated on the main processor. The distinction of being called an accelerator is that it is a special-purpose rather than a general-purpose device, but that difference is not always clear-cut. A GPU is an additional hardware component that can perform operations alongside a CPU. GPUs come in two flavors:
Integrated GPUs—A graphics processor engine that is contained on the CPU
Dedicated GPUs—A GPU contained on a separate peripheral card
集成的 GPU 直接内置于 CPU 芯片中。集成 GPU 与 CPU 共享 RAM 资源。专用 GPU 通过外围组件互连 (PCI) 插槽连接到主板。PCI 插槽是一个物理组件,允许在 CPU 和 GPU 之间传输数据。它通常被称为 PCI 总线。
Integrated GPUs are built directly into the CPU chip. Integrated GPUs share RAM resources with the CPU. Dedicated GPUs are attached to the motherboard via a Peripheral Component Interconnect (PCI) slot. The PCI slot is a physical component that allows data to be transmitted between the CPU and GPU. It is commonly referred to as the PCI bus.
英特尔®长期以来一直将集成 GPU 包含在其 CPU 中,用于预算市场。他们完全期望需要真正性能的用户会购买独立 GPU。与 AMD (Advanced Micro Devices, Inc.) 的集成版本相比,Intel 集成 GPU 历来相对较弱。这种情况最近发生了变化,英特尔声称其 Ice Lake 处理器上的集成显卡与 AMD 集成 GPU 相当。
Intel® has long included an integrated GPU with their CPUs for the budget market. They fully expected that users wanting real performance would buy a discrete GPU. The Intel integrated GPUs have historically been relatively weak in comparison to AMD’s (Advanced Micro Devices, Inc.) integrated version. This has recently changed with Intel claiming that the integrated graphics on their Ice Lake processor are on a par with AMD integrated GPUs.
AMD 集成的 GPU 称为加速处理单元 (APU)。它们是 CPU 和 GPU 的紧密耦合组合。GPU 设计的来源最初来自 AMD 于 2006 年收购 ATI 显卡公司。在 AMD APU 中,CPU 和 GPU 共享相同的处理器内存。这些 GPU 比独立 GPU 小,但仍(按比例)提供 GPU 图形(和计算)性能。AMD 的 APU 的真正目标是为大众市场提供更具成本效益但性能更高的系统。共享内存也很有吸引力,因为它消除了通过 PCI 总线的数据传输,这通常是一个严重的性能瓶颈。
The AMD integrated GPUs are called Accelerated Processing Units (APUs). These are a tightly coupled combination of the CPU and a GPU. The source of the GPU design originally came from the AMD purchase of the ATI graphics card company in 2006. In the AMD APU, the CPU and GPU share the same processor memory. These GPUs are smaller than a discrete GPU, but still (proportionally) give GPU graphics (and compute) performance. The real target for AMD for APUs is to provide a more cost-effective, but performant system for the mass market. The shared memory is also attractive because it eliminates the data transfer over the PCI bus, which is often a serious performance bottleneck.
集成 GPU 的普遍性很重要。对我们来说,这意味着现在许多商用台式机和笔记本电脑都具备了加速计算的能力。这些系统的目标是相对适度的性能提升,也许是为了降低能源成本或延长电池寿命。但就极致性能而言,独立 GPU 仍然是无可争议的性能冠军。
The ubiquitous nature of the integrated GPU is important. For us, it means that now many commodity desktops and laptops have the ability to accelerate computations. The goal on these systems is a relatively modest performance boost and, perhaps, to reduce the energy cost or to improve battery life. But for extreme performance, the discrete GPUs are still the undisputed performance champions.
在本章中,我们将主要关注具有专用 GPU(也称为离散 GPU)的 GPU 加速平台。专用 GPU 通常比集成 GPU 提供更多的计算能力。此外,这些 GPU 可以隔离以执行通用计算任务。图 9.1 从概念上说明了具有专用 GPU 的 CPU-GPU 系统。CPU 可以访问自己的内存空间 (CPU RAM),并通过 PCI 总线连接到 GPU。它能够通过 PCI 总线发送数据和指令,供 GPU 使用。GPU 有自己的内存空间,独立于 CPU 内存空间。
In this chapter, we will focus primarily on GPU accelerated platforms with dedicated GPUs, also called discrete GPUs. Dedicated GPUs generally offer more compute power than integrated GPUs. Additionally, these GPUs can be isolated to execute general-purpose computing tasks. Figure 9.1 conceptually illustrated a CPU-GPU system with a dedicated GPU. A CPU has access to its own memory space (CPU RAM) and is connected to a GPU via a PCI bus. It is able to send data and instructions over the PCI bus for the GPU to work with. The GPU has its own memory space, separate from the CPU memory space.
为了在 GPU 上执行工作,在某些时候,必须将数据从 CPU 传输到 GPU。当工作完成并且结果将写入文件时,GPU 必须将数据发送回 CPU。GPU 必须执行的指令也会从 CPU 发送到 GPU。这些事务中的每一个都由 PCI 总线调解。虽然本章不会讨论如何执行这些操作,但我们将讨论 PCI 总线的硬件性能限制。由于这些限制,设计不佳的 GPU 应用程序的性能可能比仅使用 CPU 代码的应用程序更差。我们还将讨论 GPU 的内部架构以及 GPU 在内存和浮点运算方面的性能。
In order for work to be executed on the GPU, at some point, data must be transferred from the CPU to the GPU. When the work is complete, and the results are going to be written to file, the GPU must send data back to the CPU. The instructions the GPU must execute are also sent from CPU to GPU. Each one of these transactions is mediated by the PCI bus. Although we won’t discuss how to make these actions happen in this chapter, we’ll discuss the hardware performance limitations of the PCI bus. Due to these limitations, a poorly designed GPU application can potentially have worse performance than that with CPU-only code. We’ll also discuss the internal architecture of the GPU and the performance of the GPU with regards to memory and floating-point operations.
对于我们这些多年来在 CPU 上进行线程编程的人来说,图形处理器就像理想的线程引擎。此线程引擎的组件包括
For those of us who have done thread programming over the years on a CPU, the graphics processor is like the ideal thread engine. The components of this thread engine are
让我们看看 GPU 的硬件架构,以了解它如何执行这种魔力。为了展示 GPU 的概念模型,我们抽象了来自不同 GPU 供应商的通用元素,甚至是来自同一供应商的设计变体之间的通用元素。我们必须提醒您,这些抽象模型没有捕获一些硬件变体。除了目前使用的大量术语外,该领域的新手很难理解 GPU 硬件和编程语言也就不足为奇了。尽管如此,与具有顶点着色器、纹理映射单元和片段生成器的图形世界相比,这个术语还是相对合理的。Table 9.1 总结了术语的大致等价性,但请注意,由于硬件架构不完全相同,术语中的对应关系会因上下文和用户而异。
Let’s look at the hardware architecture of a GPU to get an idea of how it performs this magic. To show a conceptual model of a GPU, we abstract the common elements from different GPU vendors and even between design variations from the same vendor. We must remind you that there are hardware variations that are not captured by these abstract models. Adding to this plethora of terminology currently in use, it is not surprising that it is difficult for a newcomer to the field to understand GPU hardware and programming languages. Still, this terminology is relatively sane compared to the graphics world with vertex shaders, texture mapping units, and fragment generators. Table 9.1 summarizes the rough equivalence of terminology, but beware that because the hardware architectures are not exactly the same, the correspondence in terminology varies depending on the context and user.
Table 9.1 Hardware terminology: A rough translation
表 9.1 中的最后一行显示了对多个数据实施单个指令的硬件层,通常称为 SIMD。严格来说,NVIDIA 硬件没有向量硬件或 SIMD,而是通过单指令、多线程 (SIMT) 模型中所谓的 warp 中的线程集合来模拟这一点。你可能想回顾一下我们在 1.4 节中对并行类别的初步讨论,以刷新你对这些不同方法的记忆。其他 GPU 还可以对 OpenCL 和 AMD 称为子组的内容执行 SIMT 操作,这相当于 NVIDIA warps。我们将在第 10 章中对此进行更多讨论,该章明确介绍了 GPU 编程模型。但是,本章将重点介绍 GPU 硬件、其体系结构和概念。
The last row in table 9.1 shows the hardware layer that implements a single instruction on multiple data, commonly referred to as SIMD. Strictly speaking, the NVIDIA hardware does not have vector hardware, or SIMD, but emulates this through a collection of threads in what it calls a warp in a single instruction, multi-thread (SIMT) model. You may want to refer back to our initial discussion of parallel categories in section 1.4 to refresh your memory on these different approaches. Other GPUs can also perform SIMT operations on what OpenCL and AMD call subgroups, which are equivalent to the NVIDIA warps. We’ll discuss this more in chapter 10, which explicitly looks at GPU programming models. This chapter, however, will focus on the GPU hardware, its architecture, and concepts.
通常, GPU 还具有硬件复制块,其中一些列在表 9.2 中,以简化其硬件设计到更多单元的扩展。这些复制单元是一种制造便利,但经常出现在规范列表和讨论中。
Often, GPUs also have hardware blocks of replication, some of which are listed in table 9.2, to simplify the scaling of their hardware designs to more units. These units of replication are a manufacturing convenience, but often show up in the specification lists and discussions.
Table 9.2 GPU hardware replication units by vendor
图 9.2 描述了具有单个多处理器 CPU 和两个 GPU 的单节点系统的简化框图。单个节点可以具有多种配置,由一个或多个具有集成 GPU 的多处理器 CPU 以及一到六个离散 GPU 组成。在 OpenCL 命名法中,每个 GPU 都是一个计算设备。但计算设备也可以是 OpenCL 中的 CPU。
Figure 9.2 depicts a simplified block diagram of a single node system with a single multiprocessor CPU and two GPUs. A single node can have a wide variety of configurations, composed of one or more multiprocessor CPUs with an integrated GPU, and from one to six discrete GPUs. In OpenCL nomenclature, each GPU is a compute device. But compute devices can also be CPUs in OpenCL.
定义OpenCL 中的计算设备是可以执行计算并支持 OpenCL 的任何计算硬件。这可以包括 GPU、CPU,甚至更奇特的硬件,例如嵌入式处理器或现场可编程门阵列 (FPGA)。
Definition A compute device in OpenCL is any computational hardware that can perform computation and supports OpenCL. This can include GPUs, CPUs, or even more exotic hardware such as embedded processors or field-programmable gate arrays (FPGAs).
图 9.2 GPU 系统的简化框图,显示了两个计算设备,每个设备都有单独的 GPU、GPU 内存和多个计算单元 (CU)。NVIDIA CUDA 术语将 CU 称为流式多处理器 (SM)。
Figure 9.2 A simplified block diagram of a GPU system showing two compute devices, each having separate GPU, GPU memory, and multiple compute units (CUs). The NVIDIA CUDA terminology refers to CUs as streaming multiprocessors (SMs).
图 9.2 中的简化图是我们描述 GPU 组件的模型,在了解 GPU 如何处理数据时也很有用。GPU 由
The simplified diagram in figure 9.2 is our model for describing the components of a GPU and is also useful when understanding how a GPU processes data. A GPU is composed of
CU 有自己的内部架构,通常称为微架构。从 CPU 接收的指令和数据由工作负载分发器处理。分发服务器协调指令执行以及进出 CU 的数据移动。GPU 可实现的性能取决于
CUs have their own internal architecture, often referred to as the microarchitecture. Instructions and data received from the CPU are processed by the workload distributor. The distributor coordinates instruction execution and data movement onto and off of the CUs. The achievable performance of a GPU depends on
在本节中,我们将探讨 GPU 模型的每个组件。对于每个组件,我们还将讨论理论峰值带宽的模型。此外,我们还将展示如何使用微基准测试工具来测量组件的实际性能。
In this section, we’ll explore each of the components for our model of a GPU. With each component, we will also discuss models for theoretical peak bandwidth. Additionally, we’ll show how to use micro-benchmark tools to measure actual performance of components.
GPU 计算设备具有多个 CU。(CU,计算单位,是社区对 OpenCL 标准商定的术语。NVIDIA 称它们为流式多处理器 (SM),而 Intel 将它们称为子切片。
A GPU compute device has multiple CUs. (CU, compute unit, is the term agreed to by the community for the OpenCL standard.) NVIDIA calls them streaming multiprocessors (SMs), and Intel refers to them as subslices.
每个 CU 都包含多个图形处理器,称为 OpenCL 中的处理元素 (PE),或 NVIDIA 所说的 CUDA 核心(或计算核心)。Intel 将它们称为执行单元 (EU),图形社区称它们为着色器处理器。
Each CU contains multiple graphics processors called processing elements (PEs) in OpenCL, or CUDA cores (or Compute Cores) as NVIDIA calls them. Intel refers to them as execution units (EUs), and the graphics community calls them shader processors.
图 9.3 显示了 PE 的简化概念图。这些处理器不等同于 CPU 处理器;它们是更简单的设计,需要执行图形操作。但是图形所需的运算几乎包括程序员在常规处理器上使用的所有算术运算。
Figure 9.3 shows a simplified conceptual diagram of a PE. These processors are not equivalent to a CPU processor; they are simpler designs, needing to perform graphics operations. But the operations needed for graphics include nearly all the arithmetic operations that a programmer uses on a regular processor.
图 9.3 具有大量处理元件 (PE) 的计算单元 (CU) 的简化框图。
Figure 9.3 Simplified block diagram of a compute unit (CU) with a large number of processing elements (PEs).
在每个 PE 中,可以对多个数据项执行操作。根据 GPU 微处理器架构和 GPU 供应商的详细信息,这些被称为 SIMT 、 SIMD 或向量运算。通过将 PE 组合在一起,可以提供类似类型的功能。
Within each PE, it might be possible to perform an operation on more than one data item. Depending on the details of the GPU microprocessor architecture and the GPU vendor, these are referred to as SIMT, SIMD, or vector operations. A similar type of functionality can be provided by ganging PEs together.
了解了 GPU 硬件后,我们现在可以计算一些最新 GPU 的峰值理论浮点数。其中包括 NVIDIA V100、AMD Vega 20、AMD Arcturus 和 Intel Ice Lake CPU 上集成的 Gen11 GPU。表 9.3 列出了这些 GPU 的规格。我们将使用这些规格来计算每个设备的理论性能。然后,了解理论性能后,您可以比较每种性能。这可以帮助您做出购买决策,或估计计算中另一个 GPU 的速度可能加快或减慢多少。许多 GPU 卡的硬件规格可以在 TechPowerUp:https://www.techpowerup.com/gpu-specs/ 中找到。
With an understanding of the GPU hardware, we can now calculate the peak theoretical flops for some recent GPUs. These include the NVIDIA V100, AMD Vega 20, AMD Arcturus, and the integrated Gen11 GPU on the Intel Ice Lake CPU. Table 9.3 lists the specifications for these GPUs. We’ll use these specifications to calculate the theoretical performance of each device. Then, knowing the theoretical performance, you can make comparisons on how each performs. This can help you with purchasing decisions or with estimating how much faster or slower another GPU might be with your calculations. Hardware specifications for many GPU cards can be found at TechPowerUp: https://www.techpowerup.com/gpu-specs/.
对于 NVIDIA 和 AMD,面向 HPC 市场的 GPU 具有硬件内核,可以为每两个单精度运算执行一个双精度运算。这种相对 flop 能力可以表示为 1:2 的比率,其中双精度是高端 GPU 上单精度的 1:2。此比率的重要性在于,它告诉您,通过将精度要求从双精度降低到单精度,可以将性能大致提高一倍。对于许多 GPU,半精度与单精度或翻牌能力的比率为 2:1。Intel 集成 GPU 相对于单精度具有 1:4 的双精度,而一些商用 GPU 的双精度与单精度的比率为 1:8。具有这些较低双精度比率的 GPU 面向图形市场或机器学习。要获得这些比率,请取 FP64 行并除以 FP32 行。
For NVIDIA and AMD, the GPUs targeted to the HPC market have the hardware cores to perform one double-precision operation for every two single-precision operations. This relative flop capability can be expressed as a ratio of 1:2, where double precision is 1:2 of single precision on top-end GPUs. The importance of this ratio is that it tells you that you can roughly double your performance by reducing your precision requirements from double precision to single. For many GPUs, half precision has a ratio of 2:1 to single precision or double the flop capability. The Intel integrated GPU has 1:4 double precision relative to single precision, and some commodity GPUs have 1:8 ratios of double precision to single precision. GPUs with these lower ratios of double precision are targeted at the graphics market or for machine learning. To get these ratios, take the FP64 row and divide by the FP32 row.
表 9.3 NVIDIA、AMD 和集成 Intel GPU 的最新独立 GPU 的规格
Table 9.3 Specifications for recent discrete GPUs from NVIDIA, AMD, and an integrated Intel GPU
峰值理论浮点数可以通过将 clock rate 乘以 processor 数量乘以每个周期的浮点运算数来计算。每个周期的 flops 考虑了融合乘加 (FMA),它在一个周期中执行两个操作。
The peak theoretical flops can be calculated by taking the clock rate times the number of processors times the number of floating-point operations per cycle. The flops per cycle accounts for the fused-multiply add (FMA), which does two operations in one cycle.
Peak Theoretical Flops (GFlops/s)
= Clock rate MHZ × Compute Units × Processing units
NVIDIA V100 和 AMD Vega 20 都提供了令人印象深刻的浮点峰值性能。Ampere 在浮点性能方面显示出一些额外的改进,但内存性能有望获得更大的提升。AMD 的 MI100 在浮点性能方面有更大的飞跃。英特尔集成 GPU 也令人印象深刻,因为它受到可用硅空间和 CPU 较低标称设计功率的限制。随着 Intel 为多个细分市场制定独立显卡计划,预计未来将会看到更多的 GPU 选项。
Both the NVIDIA V100 and the AMD Vega 20 give impressive floating-point peak performance. The Ampere shows some additional improvement in floating-point performance, but it is the memory performance that promises greater increases. The MI100 from AMD shows a bigger jump in floating-point performance. The Intel integrated GPU is also quite impressive given that it is limited by the available silicon space and lower nominal design power of a CPU. With Intel developing plans for discrete graphics cards for several market segments, expect to see even more GPU options in the future.
典型的 GPU 具有不同类型的内存。使用正确的内存空间会对性能产生重大影响。图 9.4 将这些记忆显示为概念图。它有助于查看每个内存级别的物理位置,以了解它应该如何表现。尽管供应商可以将 GPU 内存放在他们想要的任何位置,但它的行为必须如此图所示。
A typical GPU has different types of memory. Using the right memory space can make a big impact on performance. Figure 9.4 shows these memories as a conceptual diagram. It helps to see the physical locations of each level of memory to understand how it should behave. Although a vendor can put the GPU memory wherever they want, it must behave as shown in this diagram.
图 9.4 矩形显示 GPU 的每个组件以及每个硬件级别的内存。主机写入和读取全局内存和常量内存。每个 CU 都可以从全局内存读取和写入数据,也可以从常量内存中读取数据。
Figure 9.4 Rectangles show each component of the GPU and the memory that is at each hardware level. The host writes and reads the global and constant memory. Each of the CUs can read and write from the global memory and read from the constant memory.
The list of the GPU memory types and their properties are as follows.
Private memory (register memory)—Immediately accessible by a single PE and only by that PE.
本地内存 — 可供单个 CU 和该 CU 上的所有 PE 访问。本地内存可以分为可用作可编程缓存的暂存器,以及某些供应商在 GPU 上的传统缓存。本地内存的大小约为 64-96 KB。
Local memory—Accessible to a single CU and all of the PEs on that CU. Local memory can be split between a scratchpad that can be used as a programmable cache and, by some vendors, a traditional cache on GPUs. Local memory is around 64-96 KB in size.
Constant memory—Read-only memory accessible and shared across all of the CUs.
Global memory—Memory that’s located on the GPU and accessible by all of the CUs.
使 GPU 速度更快的因素之一是它们使用专用的全局内存 (RAM),这提供了更高的带宽,而当前的 CPU 使用 DDR4 内存,并且现在刚刚转向 DDR5。GPU 使用名为 GDDR5 的特殊版本,可提供更高的性能。最新的 GPU 现在正在转向提供更高带宽的高带宽内存 (HBM2)。除了增加带宽外,HBM 还降低了功耗。
One of the factors that makes GPUs fast is that they use specialized global memory (RAM), which provides higher bandwidth, whereas current CPUs use DDR4 memory and are just now moving to DDR5. GPUs use a special version called GDDR5 that gives higher performance. The latest GPUs are now moving to High-Bandwidth Memory (HBM2) that provides even higher bandwidth. Besides increasing bandwidth, HBM also reduces power consumption.
您可以根据 GPU 上的内存时钟速率和内存事务的宽度(以位为单位)计算 GPU 的理论峰值内存带宽。Table 9.4 显示了每种内存类型的一些较高值。我们还需要乘以 2 倍以获得双倍数据速率,它在周期的顶部和底部检索内存。一些 DDR 内存甚至可以在每个周期执行更多的事务。Table 9.4 还显示了不同类型图形内存的一些事务乘数。
You can calculate the theoretical peak memory bandwidth for a GPU from the memory clock rate on the GPU and the width of the memory transactions in bits. Table 9.4 shows some of the higher values for each memory type. We also need to multiply by a factor of two for the double data rate, which retrieves memory at both the top of the cycle and at the bottom. Some DDR memory can even do more transactions per cycle. Table 9.4 also shows some of the transaction multipliers for different kinds of graphics memory.
Table 9.4 Specifications for common GPU memory types
计算理论内存带宽需要内存时钟速率乘以每个周期的事务数,然后乘以每个事务检索到的位数:
Calculating the theoretical memory bandwidth takes the memory clock rate times the number of transactions per cycle and then multiplies by the number of bits retrieved on each transaction:
理论带宽 = 内存时钟速率 (GHz) × 内存总线(位)×(1 字节/8 位)×事务乘数
Theoretical Bandwidth = Memory Clock Rate (GHz) × Memory bus (bits) × (1 byte/8 bits) × transaction multiplier
一些规格表以 Gbps 为单位给出内存事务速率,而不是内存时钟频率。此速率是每个周期的事务数乘以 clock rate。给定此规格,带宽方程变为
Some specification sheets give the memory transaction rate in Gbps rather than the memory clock frequency. This rate is the transactions per cycle times the clock rate. Given this specification, the bandwidth equation becomes
理论带宽 = 内存事务率 (Gbps) × 内存总线(位)×(1 字节/8 位)
Theoretical Bandwidth = Memory Transaction Rate(Gbps) × Memory bus (bits) × (1 byte/8 bits)
由于我们的大多数应用程序都随内存带宽进行扩展,因此测量内存带宽的 STREAM 基准测试是最重要的微基准测试之一。在第 3.2.4 节中,我们首先使用 STREAM 基准测试来测量 CPU 上的带宽。GPU 的基准测试过程类似,但我们需要用 GPU 语言重写流内核。幸运的是,布里斯托大学的 Tom Deakin 已经为他的 Babel STREAM 代码中的各种 GPU 语言和硬件完成了这项工作。
Because most of our applications scale with memory bandwidth, the STREAM Benchmark that measures memory bandwidth is one of the most important micro-benchmarks. We first used the STREAM Benchmark to measure the bandwidth on CPUs in section 3.2.4. The benchmark process is similar for the GPUs, but we need to rewrite the stream kernels in GPU languages. Fortunately, this has been done by Tom Deakin at the University of Bristol for a variety of GPU languages and hardware in his Babel STREAM code.
Babel STREAM 基准测试代码测量具有不同编程语言的各种硬件的带宽。我们在这里使用它来测量使用 CUDA 的 NVIDIA GPU 的带宽。此外,还提供 OpenCL、HIP、OpenACC、Kokkos、Raja、SYCL 和带有 GPU 目标的 OpenMP 版本。这些都是可用于 NVIDIA、AMD 和 Intel GPU 等 GPU 硬件的不同语言。
The Babel STREAM Benchmark code measures the bandwidth of a variety of hardware with different programming languages. We use it here to measure the bandwidth of an NVIDIA GPU using CUDA. Also available are versions in OpenCL, HIP, OpenACC, Kokkos, Raja, SYCL, and OpenMP with GPU targets. These are all different languages that can be used for GPU hardware like NVIDIA, AMD, and Intel GPUs.
我们在 3.2.4 节中介绍了 CPU 的 roofline 性能模型。此模型考虑了系统的内存带宽和 flop 性能限制。GPU 了解其性能限制同样有用。
We introduced the roofline performance model for CPUs in section 3.2.4. This model accounts for both memory bandwidth and flop performance limits of the system. It is similarly useful for GPUs to understand their performance limits.
图 9.5 显示了 NVIDIA V100 和 AMD Vega20 GPU 的车顶基准测试结果。
Figure 9.5 shows the results of the roofline benchmarks for both the NVIDIA V100 and the AMD Vega20 GPUs.
图 9.5 NVIDIA V100 和 AMD Vega 20 的屋顶线图,显示了两个 GPU 的带宽和浮点数限制。
Figure 9.5 Roofline plots for NVIDIA V100 and AMD Vega 20 showing the bandwidth and flop limits for the two GPUs.
云服务和 HPC 服务器市场有许多 GPU 选项。有没有办法为您的应用程序找出最有价值的 GPU?我们将介绍一个性能模型,它可以帮助您为您的工作负载选择最佳 GPU。
There are many GPU options for cloud services and in the HPC server market. Is there a way to figure out the best value GPU for your application? We’ll look at a performance model that can help you select the best GPU for your workload.
通过将 roofline 图中的自变量从算术强度更改为内存带宽,可以突出显示应用程序相对于每个 GPU 设备的性能限制。开发 Mixbench 工具是为了找出不同 GPU 设备的性能差异。此信息实际上与车顶线模型中显示的信息没有什么不同,但在视觉上具有不同的影响。让我们使用 MixBench 工具进行一个练习,以展示您可以学到什么。
By changing the independent variable in the roofline plot from arithmetic intensity to the memory bandwidth, the performance limits of an application relative to each GPU device are highlighted. The mixbench tool was developed to draw out the differences between the performance of different GPU devices. This information is really no different than that shown in the roofline model, but visually it has a different impact. Let’s go through an exercise using the mixbench tool to show what you can learn.
基准测试的结果绘制为以 GFlops/sec 为单位的计算速率相对于以 GB/sec 为单位的内存带宽(图 9.6)。基本上,基准测试的结果是在峰值 flop rate 处出现一条水平线,在内存带宽限制处出现一条垂直下降。取这些值中的最大值,并用于绘制右上角的单个点,该点捕获 GPU 设备的峰值 flop 和峰值带宽能力。
The results of the benchmark are plotted as the compute rate in GFlops/sec with respect to the memory bandwidth in GB/sec (figure 9.6). Basically, the benchmark results in a horizontal line at the peak flop rate and a vertical dropoff at the memory bandwidth limit. The maximum of each of these values is taken and used to plot the single point at the upper right that captures both the peak flop and peak bandwidth capabilities of the GPU device.
图 9.6 在 V100 上运行 Mixbench 的数据输出(显示为线图)。找到最大带宽和浮点速率,并用于在图的右上角绘制 V100 性能点。
Figure 9.6 The data output from a run of mixbench on a V100 (shown as a line plot). The maximum bandwidth and floating-point rate is found and used to plot the V100 performance point in the upper right of the plot.
我们可以为各种 GPU 设备运行 Mixbench 工具,并获得它们的峰值性能特征。专为 HPC 市场设计的 GPU 设备具有很高的双精度浮点功能,而用于图形和机器学习等其他市场的 GPU 则侧重于单精度硬件。
We can run the mixbench tool for a variety of GPU devices and get their peak performance characteristics. GPU devices designed for the HPC market have a high double-precision floating-point capability, and GPUs for other markets like graphics and machine learning focus on single-precision hardware.
我们可以在图 9.7 中绘制每个 GPU 设备以及一条表示应用程序的算术或操作强度的线。大多数典型应用在 1 flop/load 强度左右。在另一个极端,矩阵乘法的算术强度为 65 flops/load。我们在图 9.7 中显示了这两种类型的应用程序的斜线。如果 GPU 点位于应用程序线上方,则我们向下绘制一条垂直线到应用程序线,以查找可实现的应用程序性能。对于右侧且低于 application line 的设备,我们使用水平线来查找性能限制。
We can plot each GPU device in figure 9.7 along with a line representing the arithmetic or operational intensity of an application. Most typical applications are around a 1 flop/load intensity. At the other extreme, matrix multiplication has an arithmetic intensity of 65 flops/load. We show a sloped line for both of these types of applications in figure 9.7. If the GPU point is above the application line, we draw a vertical line down to the application line to find what the achievable application performance will be. For a device to the right and lower than the application line, we use a horizontal line to find the performance limit.
图 9.7 GPU 设备的性能点集合(如右图所示)以及应用程序算术强度(以直线所示)。线上方的值表示应用程序受内存限制,线下方的值表示应用程序受计算限制。
Figure 9.7 A collection of performance points for GPU devices (shown on the plot on the right) along with the application arithmetic intensity (shown as straight lines). Values above the line indicate that the application is memory-bound and below the line indicates it is compute-bound.
该图清楚地表明了 GPU 设备特征与应用程序要求的匹配。对于具有 1 flop/load 算术强度的典型应用程序,像 GeForce GTX 1080Ti 这样为图形市场打造的 GPU 是不错的选择。V100 GPU 更适合用于大型计算系统 TOP500 排名的 Linpack 基准测试,因为它基本上由矩阵乘法组成。像 V100 这样的 GPU 是专为 HPC 市场打造的专用硬件,价格溢价很高。对于一些算术强度较低的应用程序,为图形市场设计的商用 GPU 可能更有价值。
What the plot makes clear is the match in the GPU device characteristics relative to the application requirements. For the typical application that has a 1 flop/load arithmetic intensity, GPUs like the GeForce GTX 1080Ti, built for the graphics market, are a good match. The V100 GPU is more suited for the Linpack benchmark used for the TOP500 ranking of large computing systems because it’s basically composed of matrix multiplications. GPUs like the V100 are specialized hardware specifically built for the HPC market and command a high price premium. For some applications with lower arithmetic intensity, the commodity GPUs designed for the graphics market can be a better value.
需要图 9.1 所示的 PCI 总线将数据从 CPU 传输到 GPU 并返回。数据传输的成本可能会严重限制移动到 GPU 的操作的性能。限制来回传输的数据量以从 GPU 获得任何加速通常至关重要。
The PCI bus shown in figure 9.1 is needed to transfer data from the CPU to the GPU and back. The cost of the data transfer can be a significant limitation on the performance of operations moved to the GPU. It is often critical to limit the amount of data that gets transferred back and forth to get any speedup from the GPU.
PCI 总线的当前版本称为 PCI Express (PCIe)。截至撰写本文时,它已在 1.0 到 6.0 的几代中进行了多次修订。您应该知道系统中有哪一代 PCIe 总线,以了解其性能限制。在本节中,我们将展示两种估算 PCI 总线带宽的方法:
The current version of the PCI bus is called PCI Express (PCIe). It has been revised several times in generations from 1.0 to 6.0 as of this writing. You should know which generation of the PCIe bus you have in your system to understand its performance limitation. In this section, we show two methods for estimating the bandwidth of a PCI bus:
理论峰值性能模型可用于快速估计新系统上的预期结果。该模型的优点是您无需在系统上运行任何应用程序。当您刚开始一个项目并希望快速手动估计可能的性能瓶颈时,这非常有用。此外,基准测试示例显示,您可以达到的峰值带宽取决于您如何使用硬件。
The theoretical peak performance model is useful for quickly estimating what you might expect on a new system. The model has the benefit that you don’t need to run any applications on the system. This is useful when you are just starting a project and would like to quickly estimate by hand the possible performance bottlenecks. In addition, the benchmark example shows that the peak bandwidth you can reach depends on how you make use of the hardware.
在专用 GPU 平台上,GPU 和 CPU 之间的所有数据通信都通过 PCI 总线进行。因此,它是一个关键的硬件组件,可能会严重影响应用程序的整体性能。在本节中,我们将介绍您需要了解的 PCI 总线的主要特性和描述符,以便计算理论 PCI 总线带宽。了解如何动态计算此数字有助于估计将应用程序移植到 GPU 时可能的性能限制。
On dedicated GPU platforms, all data communication between the GPU and CPU occurs over the PCI bus. Because of this, it is a critical hardware component that can heavily influence the overall performance of your application. In this section, we go over the key features and descriptors of a PCI bus that you need to be aware of in order to calculate the theoretical PCI bus bandwidth. Knowing how to calculate this number on the fly is useful for estimating possible performance limitations when porting your application to GPUs.
PCI 总线是将专用 GPU 连接到 CPU 和其他设备的物理组件。它允许 CPU 和 GPU 之间的通信。通信通过多个 PCIe 通道进行。我们首先提出一个理论带宽的公式,然后解释每个项。
The PCI bus is a physical component that attaches dedicated GPUs to the CPU and other devices. It allows for communication between the CPU and GPU. Communication occurs over multiple PCIe lanes. We’ll start by presenting a formula for the theoretical bandwidth, and then explain each term.
理论带宽 (GB/s) = 通道数 × 传输速率 (GT/s) × 开销因子 (Gb/GT) ×字节/8 位
Theoretical Bandwidth (GB/s) = Lanes × TransferRate (GT/s) × OverheadFactor(Gb/GT) × byte/8 bits
理论带宽以每秒千兆字节 (GB/s) 为单位进行测量。它是通过将通道数和每个通道的最大传输速率相乘,然后从位转换为字节来计算的。转换保留在公式中,因为传输速率通常以每秒 GigaTransfers (GT/s) 为单位报告。开销因子是由于用于确保数据完整性的编码方案降低了有效传输速率。对于第 1.0 代设备,编码方案的成本为 20%,因此开销系数为 100%-20% 或 80%。从第 3.0 代开始,编码方案开销下降到仅 1.54%,因此实现的带宽与传输速率基本相同。现在,我们深入研究带宽等式中的每个术语。
The theoretical bandwidth is measured in units of gigabytes per second (GB/s). It is calculated by multiplying the number of lanes and the maximum transfer rate for each lane and then converting from bits to bytes. The conversion is left in the formula because the transfer rates are usually reported in GigaTransfers per second (GT/s). The overhead factor is due to an encoding scheme used to ensure data integrity, reducing the effective transfer rate. For generation 1.0 devices, the encoding scheme had a cost of 20%, so the overhead factor would be 100%-20% or 80%. From generation 3.0 onward, the encoding scheme overhead drops to just 1.54%, so the achieved bandwidth becomes essentially the same as the transfer rate. Let’s now dive into each of the terms in the bandwidth equation.
PCI 总线的通道数可以通过查看制造商规格来找到,或者您可以使用 Linux 平台上提供的许多工具。请记住,其中一些工具可能需要 root 权限。如果您没有这些权限,最好咨询您的系统管理员以了解此信息。尽管如此,我们将提供两个选项来确定 PCIe 通道的数量。
The number of lanes of a PCI bus can be found by looking through manufacturer specifications, or you can use a number of tools available on Linux platforms. Keep in mind that some of these tools might require root privileges. If you do not have these privileges, it is best to consult your system administrator to find out this information. Nonetheless, we will present two options for determining the number of PCIe lanes.
Linux 系统上可用的常用实用程序是 lspci。此实用程序列出了连接到主板的所有组件。我们可以使用 grep 正则表达式工具仅过滤掉 PCI 桥。以下命令显示供应商信息和设备名称以及 PCIe 通道数。在此示例中,输出中的 (x16) 表示有 16 个通道。
A common utility available on Linux systems is lspci. This utility lists all the components attached to the motherboard. We can use the grep regular expression tool to filter out only the PCI bridge. The following command shows you the vendor information and the device name with the number of PCIe lanes. For this example, (x16) in the output indicates that there are 16 lanes.
$ lspci -vmm | grep "PCI bridge" -A2 Class: PCI bridge Vendor: Intel Corporation Device: Sky Lake PCIe Controller (x16)
$ lspci -vmm | grep "PCI bridge" -A2 Class: PCI bridge Vendor: Intel Corporation Device: Sky Lake PCIe Controller (x16)
Alternatively, the dmidecode command provides similar information:
$ dmidecode | grep "PCI" PCI is supported Type: x16 PCI Express
$ dmidecode | grep "PCI" PCI is supported Type: x16 PCI Express
Determining the maximum transfer rate
PCIe 总线中每个通道的最大传输速率可以直接由其设计生成决定。代系是硬件所需性能的规范,就像 4G 是手机的行业标准一样。PCI 特别兴趣小组 (PCI SIG) 代表行业合作伙伴,并建立了通常简称为代或代的 PCIe 规范。表 9.5 显示了每个 PCI 通道和方向的最大传输速率。
The maximum transfer rates for each lane in a PCIe bus can directly be determined by its design generation. Generation is a specification for the required performance of the hardware, much like 4G is an industry standard for cell-phones. The PCI Special Interest Group (PCI SIG) represents industry partners and establishes a PCIe specification that is commonly referred to as generation or gen for short. Table 9.5 shows the maximum transfer rate per PCI lanes and direction.
表 9.5 各代 PCI Express (PCIe) 规格
Table 9.5 PCI Express (PCIe) specifications by generation
如果您不知道 PCIe 总线的代系,可以使用 lspci 来获取此信息。在 lspci 输出的所有信息中,我们正在寻找 PCI 总线的链路容量。在此输出中,链路容量缩写为 LnkCap:
If you don’t know the generation of your PCIe bus, you can use lspci to get this information. In all of the information output by lspci, we are looking for the link capacity for the PCI bus. In this output, link capacity is abbreviated LnkCap:
$ sudo lspci -vvv | grep -E 'PCI|LnkCap'
Output:
00:01.0 PCI bridge:
Intel Corporation Sky Lake PCIe Controller (x16) (rev 07)
LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s$ sudo lspci -vvv | grep -E 'PCI|LnkCap'
Output:
00:01.0 PCI bridge:
Intel Corporation Sky Lake PCIe Controller (x16) (rev 07)
LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s
现在我们知道了此输出的最大传输速率,我们可以在带宽公式中使用它。知道这个速度也可以与一代保持一致也很有帮助。在这种情况下,输出表明我们正在使用 Gen3 PCIe 系统。
Now that we know the maximum transfer rate from this output, we can use this in the bandwidth formula. It’s also helpful to know that this speed can also be aligned with the generation. In this case, the output indicates that we are working with a Gen3 PCIe system.
注意在某些系统上,lspci 和其他系统实用程序的输出可能不会提供太多信息。输出是特定于系统的,仅报告每个设备的标识。如果您无法确定这些实用程序的特征,则回退可能是使用 Section 9.4.2 中给出的 PCI 基准测试代码来确定系统的功能。
Note On some systems, the output from lspci and other system utilities may not give much information. The output is system specific and just reports the identification from each device. If you are unable to determine the characteristics from these utilities, your fallback might be to use the PCI benchmark code given in section 9.4.2 to determine the capabilities of your system.
通过 PCI 总线传输数据需要额外的开销。第 1 代和第 2 代标准规定,每 8 字节的有用数据传输 10 字节。从第 3 代开始,传输每 128 字节数据传输 130 字节。开销因子是可用字节数与传输的总字节数之比(表 9.5)。
Transmitting data across the PCI bus requires additional overhead. Generation 1 and 2 standards stipulate that 10 bytes are transmitted for every 8 bytes of useful data. Starting with generation 3, the transfer transmits 130 bytes for every 128 bytes of data. The overhead factor is the ratio of the number of usable bytes over the total bytes transmitted (table 9.5).
Reference Data for PCIe theoretical peak bandwidth
现在我们已经有了所有必要的信息,让我们通过一个示例来估计理论带宽,使用前面几节中所示的输出。
Now that we have all of the necessary information, let’s estimate the theoretical bandwidth through an example, using output shown in the previous sections.
理论 PCI 带宽的公式给出了预期的最佳峰值带宽。换句话说,这是应用程序可以为给定平台实现的最高带宽。在实践中,实现的带宽可能取决于许多因素,包括操作系统、系统驱动程序、计算节点上的其他硬件组件、GPU 编程 API 以及通过 PCI 总线发送的数据块的大小。在您可以访问的大多数系统上,除了最后两个选项之外,其他所有选项都可能超出您的修改控制范围。在开发应用程序时,您可以控制编程 API 和通过 PCI 总线传输的数据块的大小。
The equation for the theoretical PCI bandwidth gives the expected best peak bandwidth. In other words, this is the highest possible bandwidth that an application can achieve for a given platform. In practice, the achieved bandwidth can depend on a number of factors, including the OS, system drivers, other hardware components on the compute node, GPU programming API, and the size of the data block sent across the PCI bus. On most systems you have access to, it is likely that all but the last two choices are out of your control to modify. When developing your application, you are in control of the programming API and the size of the data blocks that are transmitted across the PCI bus.
考虑到这一点,我们剩下的问题是数据块大小如何影响实现的带宽?这种类型的问题通常用微基准测试来回答。微基准测试是一个小程序,旨在执行大型应用程序将使用的单个进程或硬件。微基准测试有助于提供系统性能的一些指示。
With this in mind, we are left with the question how does the data block size influence the achieved bandwidth? This type of question is typically answered with a micro-benchmark. A micro-benchmark is a small program that is meant to exercise a single process or piece of hardware that a larger application will use. Micro-benchmarks help provide some indication of system performance.
在我们的情况下,我们想要设计一个微基准测试,将数据从 CPU 复制到 GPU,反之亦然。由于数据复制预计在微秒到数十微秒内发生,因此我们将测量完成数据复制 1000 次所需的时间。然后将此时间除以 1,000 以获得在 CPU 和 GPU 之间复制数据的平均时间。
In our situation, we want to devise a micro-benchmark that copies data from the CPU to the GPU and vice-versa. Because the data copy is expected to happen in microseconds to tens of microseconds, we will measure the time it takes to complete the data copy 1,000 times. This time will then be divided by 1,000 to obtain the average time to copy data between the CPU and GPU.
我们将逐步演练使用基准测试应用程序来测量 PCI 带宽。清单 9.1 显示了将数据从主机复制到 GPU 的代码。第 10 章将介绍使用 CUDA GPU 编程语言编写,但从函数名称中可以清楚地了解基本操作。在列表 9.1 中,我们采用平面 1-D 数组的大小作为输入,我们想要从 CPU (主机) 复制到 GPU (设备)。
We will do a step-by-step walk through using a benchmark application to measure PCI bandwidth. Listing 9.1 shows the code that copies data from the host to the GPU. Writing in the CUDA GPU programming language will be covered in chapter 10, but the basic operation is clear from the function names. In listing 9.1, we take in the size of a flat 1-D array as input that we want to copy from CPU (host) to the GPU (device).
注意此代码位于 https://github.com/EssentialsofParallelComputing/Chapter9 的 PCI_Bandwidth_Benchmark 子目录中。
Note This code is available in the PCI_Bandwidth_Benchmark subdirectory at https://github.com/EssentialsofParallelComputing/Chapter9.
Listing 9.1 Copying data from CPU host to GPU device
PCI_Bandwidth_Benchmark.c
35 void Host_to_Device_Pinned( int N, double *copy_time )
36 {
37 float *x_host, *x_device;
38 struct timespec tstart;
39
40 cudaError_t status = cudaMallocHost((void**)&x_host, N*sizeof(float));❶
41 if (status != cudaSuccess)
42 printf("Error allocating pinned host memory\n");
43 cudaMalloc((void **)&x_device, N*sizeof(float)); ❷
44
45 cpu_timer_start(&tstart);
46 for(int i = 1; i <= 1000; i++ ){
47 cudaMemcpy(x_device, x_host, N*sizeof(float),
cudaMemcpyHostToDevice); ❸
48 }
49 cudaDeviceSynchronize(); ❹
50
51 *copy_time = cpu_timer_stop(tstart)/1000.0;
52
53 cudaFreeHost( x_host ); ❺
54 cudaFree( x_device ); ❺
55 } PCI_Bandwidth_Benchmark.c
35 void Host_to_Device_Pinned( int N, double *copy_time )
36 {
37 float *x_host, *x_device;
38 struct timespec tstart;
39
40 cudaError_t status = cudaMallocHost((void**)&x_host, N*sizeof(float));❶
41 if (status != cudaSuccess)
42 printf("Error allocating pinned host memory\n");
43 cudaMalloc((void **)&x_device, N*sizeof(float)); ❷
44
45 cpu_timer_start(&tstart);
46 for(int i = 1; i <= 1000; i++ ){
47 cudaMemcpy(x_device, x_host, N*sizeof(float),
cudaMemcpyHostToDevice); ❸
48 }
49 cudaDeviceSynchronize(); ❹
50
51 *copy_time = cpu_timer_stop(tstart)/1000.0;
52
53 cudaFreeHost( x_host ); ❺
54 cudaFree( x_device ); ❺
55 }
❶ Allocates pinned host memory for an array on the CPU
❷ Allocates memory for an array on the GPU
❹ Synchronizes the GPU so that the work completes
在列表 9.1 中,第一步是为主机和设备副本分配内存。此清单中第 40 行的例程使用 cudaMallocHost 在主机上分配固定内存,以加快数据传输速度。对于使用常规可分页内存的例程,使用标准 malloc 和 free 调用。cudaMemcpy 例程将数据从 CPU 主机传输到 GPU。cudaDeviceSynchronize 调用会等待,直到复制完成。在循环之前,我们重复主机到设备的复制,我们捕获开始时间。然后,我们执行主机到设备复制 1000 次,并再次捕获当前时间。然后,通过除以 1,000 来计算从主机复制到设备的平均时间。为了保持整洁,我们释放了 host 和 device 数组所占用的空间。
In listing 9.1, the first step is to allocate memory for both the host and device copy. The routine on line 40 in this listing uses cudaMallocHost to allocate pinned memory on the host for faster data transfer. For the routine that uses regular pageable memory, the standard malloc and free calls are used. The cudaMemcpy routine transfers the data from the CPU host to the GPU. The cudaDeviceSynchronize call waits until the copy is complete. Before the loop, where we repeat the host to device copy, we capture the start time. We then execute the host-to-device copy 1,000 times and capture the current time again. The average time for copying from host to device is then calculated by dividing by 1,000. To keep things neat, we free the space held by the host and device arrays.
了解了将大小为 N 的数组从主机传输到设备所需的时间后,我们现在可以多次调用此例程,每次都更改 N。但是,我们更感兴趣的是估计实现的带宽。
With the knowledge of the time it takes to transfer an array of size N from host to device, we can now call this routine multiple times, changing N each time. However, we’re more interested in estimating the achieved bandwidth.
回想一下,带宽是每单位时间传输的字节数。目前,我们知道数组元素的数量以及在 CPU 和 GPU 之间复制数组所需的时间。传输的字节数取决于数组中存储的数据类型。例如,在包含浮点数(4 字节)的大小为 N 的数组中,在 CPU 和 GPU 之间复制的数据量为 4N。如果在时间 T 上传输 4N 字节,则实现的带宽为
Recall that the bandwidth is the number of transmitted bytes per unit time. At the moment, we know the number of array elements and the time it takes to copy the array between the CPU and GPU. The number of bytes transmitted depends on the type of data stored in the array. For example, in an array of size N containing floats (4 bytes), the amount of data copied between CPU and GPU is 4N. If 4N bytes are transferred in time T, the achieved bandwidth is
这允许我们构建一个数据集,将实现的带宽显示为 N 的函数。以下清单中的子例程要求指定最大数组大小,然后返回为每个实验测得的带宽。
This allows us to build a dataset showing the achieved bandwidth as a function of N. The subroutine in the following listing requires that the maximum array size is specified and then returns the bandwidth measured for each experiment.
列表 9.2 为不同的数组大小调用 CPU 到 GPU 的内存传输
Listing 9.2 Calling a CPU to GPU memory transfer for different array sizes
PCI_Bandwidth_Benchmark.c
81 void H2D_Pinned_Experiments(double **bandwidth, int n_experiments,
int max_array_size){
82 long long array_size;
83 double copy_time;
84
85 for(int j=0; j<n_experiments; j++){ ❶
86 array_size = 1;
87 for(int i=0; i<max_array_size; i++ ){ ❷
88
89 Host_to_Device_Pinned( array_size, ©_time ); ❸
90
91 double byte_size=4.0*array_size; ❹
92 bandwidth[j][i] = byte_size/(copy_time*1024.0*1024.0*1024.0); ❹
93
94 array_size = array_size*2; ❷
95 }
96 }
97 }PCI_Bandwidth_Benchmark.c
81 void H2D_Pinned_Experiments(double **bandwidth, int n_experiments,
int max_array_size){
82 long long array_size;
83 double copy_time;
84
85 for(int j=0; j<n_experiments; j++){ ❶
86 array_size = 1;
87 for(int i=0; i<max_array_size; i++ ){ ❷
88
89 Host_to_Device_Pinned( array_size, ©_time ); ❸
90
91 double byte_size=4.0*array_size; ❹
92 bandwidth[j][i] = byte_size/(copy_time*1024.0*1024.0*1024.0); ❹
93
94 array_size = array_size*2; ❷
95 }
96 }
97 }
❶ Repeats the experiments a few times
❷ Doubles the array size with each iteration
❸ Calls CPU to GPU memory test and timing
在这里,我们遍历数组大小,对于每个数组大小,我们获得主机到设备的平均复制时间。然后,带宽的计算方法是复制的字节数除以复制所需的时间。数组包含 floats,每个数组元素有四个字节。现在,让我们通过一个示例来说明如何使用微基准测试应用程序来描述 PCI 总线的性能。
Here, we loop over array sizes, and for each array size, we obtain the average host-to-device copy time. The bandwidth is then calculated by the number of bytes copied divided by the time it takes to copy. The array contains floats, which have four bytes for each array element. Now, let’s walk through an example that shows how you can use the micro-benchmark application to characterize the performance of your PCI Bus.
Performance of a Gen3 x16 on a laptop
我们在 GPU 加速的笔记本电脑上运行 PCI 带宽基准测试应用程序。在此系统上,lspci 命令显示它配备了 Gen3 x16 PCI 总线:
We ran the PCI bandwidth benchmark application on a GPU accelerated laptop. On this system, the lspci command shows that it is equipped with a Gen3 x16 PCI bus:
$ sudo lspci -vvv | grep -E 'PCI|LnkCap'
00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16)
(rev 07)
LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1,
Exit Latency L0s$ sudo lspci -vvv | grep -E 'PCI|LnkCap'
00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16)
(rev 07)
LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1,
Exit Latency L0s
对于该系统,PCI 总线的理论峰值带宽为 15.8 GB/s。图 9.8 显示了我们的微基准测试应用程序(曲线)中实现的带宽与理论峰值带宽(水平虚线)的比较图。所实现带宽周围的阴影区域表示带宽的 +/- 1 标准差。
For this system, the theoretical peak bandwidth of the PCI Bus is 15.8 GB/s. Figure 9.8 shows a plot of the achieved bandwidth in our micro-benchmark application (curved lines) compared to the theoretical peak bandwidth (horizontal dashed lines). The shaded region around the achieved bandwidth indicates the +/- 1 standard deviation in bandwidth.
图 9.8 显示了 Gen3 x16 PCIe 系统的理论峰值带宽(水平线)和微基准测试应用程序的经验测量带宽。该图还显示了固定内存和可分页内存的结果。
Figure 9.8 A theoretical peak bandwidth (horizontal lines) and an empirically measured bandwidth from the micro-benchmark application are shown for a Gen3 x16 PCIe system. The figure also shows results for both pinned and pageable memory.
首先,请注意,当通过 PCI 总线发送小块数据时,实现的带宽很低。超过 107 字节的数组大小后,实现的带宽最大接近 11.6 GB/s。另请注意,固定内存的带宽远高于可分页内存,并且对于每个内存大小,可分页内存的性能结果差异要大得多。了解什么是固定内存和可分页内存有助于了解这种差异的原因。
First, notice that when small chunks of data are sent across the PCI bus, the achieved bandwidth is low. Beyond array sizes of 107 bytes, the achieved bandwidth approaches a maximum around 11.6 GB/s. Note also that the bandwidth with pinned memory is much higher than pageable memory, and pageable memory has a much wider variation of performance results for each memory size. It is helpful to know what pinned and pageable memory are to understand the reason for this difference.
Pinned memory—Memory that cannot be paged from RAM and, thus, can be directly sent to the GPU without first making a copy
Pageable memory—Standard memory allocations that can be paged out to disk
分配固定内存会减少其他进程可用的内存,因为 OS 内核无法再将内存分页到磁盘,以便其他进程可以使用它。固定内存是从处理器的标准 DRAM 内存中分配的。分配过程花费的时间比常规分配要长一些。使用可分页内存时,必须先将其复制到固定内存位置,然后才能发送。固定内存可防止在传输内存时将其分页到磁盘。本节中的示例表明,CPU 和 GPU 之间的数据传输量更大,可实现的带宽更高。此外,在该系统上,实现的最大带宽仅达到理论峰值性能的 72% 左右。
Allocating pinned memory reduces the memory available to other processes because the OS kernel can no longer make the memory page out to disk so other processes can use it. Pinned memory is allocated from the standard DRAM memory for the processor. The allocation process takes a little bit longer than a regular allocation. When using pageable memory, it must be copied into a pinned memory location before it can be sent. The pinned memory prevents it from being paged out to disk while the memory is being transferred. The example in this section shows that larger data transfers between the CPU and GPU can result in higher achieved bandwidth. Further, on this system, the maximum achieved bandwidth only reaches about 72% of the theoretical peak performance.
现在我们已经介绍了 GPU 加速平台的基本组件,我们将讨论您可能会遇到的更多奇特配置。这些奇特的配置来自多个 GPU 的引入。某些平台为每个节点提供多个 GPU,连接到一个或多个 CPU。其他 Cookie 通过网络硬件提供与多个计算节点的连接。
Now that we’ve introduced the basic components of a GPU-accelerated platform, we’ll discuss more exotic configurations that you might encounter. These exotic configurations come from the introduction of multiple GPUs. Some platforms offer multiple GPUs per node, connected to one or more CPUs. Others offer connections to multiple compute nodes over network hardware.
在多 GPU 平台类型(图 9.9)上,通常需要使用 MPI+GPU 方法进行并行。对于数据并行性,每个 MPI 等级都分配给其中一个 GPU。让我们看看几种可能性:
On the types of multi-GPU platforms (figure 9.9), it is usually necessary to use an MPI+GPU approach to parallelism. For data parallelism, each MPI rank is assigned to one of the GPUs. Let’s look at a couple of possibilities:
图 9.9 这里我们展示了一个多 GPU 平台。单个计算节点可以具有多个 GPU 和多个处理器。一个网络中还可以连接多个节点。
Figure 9.9 Here we illustrate a multi-GPU platform. A single compute node can have multiple GPUs and multiple processors. There can also be multiple nodes connected across a network.
一些早期的 GPU 软件和硬件无法有效地处理多路复用,导致性能不佳。随着最新软件中修复了许多性能问题,在 GPU 上多路复用 MPI 等级变得越来越有吸引力。
Some of the early GPU software and hardware did not handle multiplexing efficiently, resulting in poor performance. With many of the performance problems fixed in the latest software, it is becoming increasingly attractive to multiplex MPI ranks onto the GPUs.
要使用多个 GPU,我们必须将数据从一个 GPU 发送到另一个 GPU。在讨论优化之前,我们需要描述标准数据传输过程。
To use multiple GPUs, we have to send data from one GPU to another. Before we can discuss the optimization, we need to describe the standard data transfer process.
如图 9.10 所示,这是大量的数据移动,将成为应用程序性能的主要限制。
As figure 9.10 shows, this is a lot of data movement and will be a major limitation to application performance.
在 NVIDIA GPUDirect® 中, CUDA 增加了在消息中发送数据的功能。AMD 在 OpenCL 中具有类似的功能,称为 DirectGMA,用于 GPU 到 GPU。指向数据的指针仍然需要传输,但消息本身通过 PCI 总线直接从一个 GPU 发送到另一个 GPU,从而减少内存移动。
In an NVIDIA GPUDirect®, CUDA adds the capability to send the data in a message. AMD has a similar capability called DirectGMA for GPU-to-GPU in OpenCL. The pointer to the data still has to be transferred, but the message itself gets sent over the PCI bus directly from one GPU to another GPU, thereby reducing the memory movement.
图 9.10 顶部是将数据从 GPU 发送到其他 GPU 的标准数据移动。在底部,当数据从一个 GPU 移动到另一个 GPU 时,数据移动会绕过 CPU。
Figure 9.10 On the top is the standard data movement for sending data from a GPU to other GPUs. On the bottom, the data movement bypasses the CPU when moving data from one GPU to another.
毫无疑问,PCI 总线是具有多个 GPU 的计算节点的主要限制。虽然这主要是大型应用程序关心的问题,但它也会影响繁重的工作负载,例如在较小集群上进行机器学习。NVIDIA 推出了 NVLink®,用其 Volta 系列 P100 和 V100 GPU 取代 GPU 到 GPU 和 GPU 到 CPU 的连接。使用 NVLink 2.0,数据传输速率可以达到 300 GB/秒。AMD 的新 GPU 和 CPU 集成了 Infinity Fabric 以加快数据传输速度。多年来,Intel 一直在加速 CPU 和内存之间的数据传输。
There is little argument that the PCI bus is a major limitation for compute nodes with multi-GPUs. While this is mostly a concern for large applications, it also impacts heavy workloads such as in machine learning on smaller clusters. NVIDIA introduced NVLink® to replace GPU-to-GPU and GPU-to-CPU connections with their Volta line of P100 and V100 GPUs. With NVLink 2.0, the data transfer rates can reach 300 GB/sec. The new GPUs and CPUs from AMD incorporate Infinity Fabric to speed up data transfers. And Intel has been accelerating data transfers between CPUs and memory for some years.
何时值得移植到 GPU?此时,您已经了解了现代 GPU 的理论峰值性能,以及它与 CPU 的比较。将图 9.5 中 GPU 的屋顶图与第 3.2.4 节中的 CPU 屋顶线图进行比较,以了解浮点计算和内存带宽限制。在实践中,许多应用程序没有达到这些峰值性能值。然而,随着 GPU 的上限提高,相对于一些指标,其性能有可能优于 CPU 架构。这些因素包括执行应用程序的时间、能耗、云计算成本和可扩展性。
When is porting to GPUs worth it? At this point you’ve seen the theoretical peak performance for modern GPUs and how this compares to CPUs. Compare the roofline plot for the GPU in figure 9.5 to the CPU roofline plot in section 3.2.4 for both the floating-point calculation and the memory bandwidth limits. In practice, many applications do not reach these peak performance values. However, with the ceilings raised for GPUs, there is potential to outperform CPU architectures relative to a few metrics. These include the time to execute your application, energy consumption, cloud computing costs, and scalability.
假设您有一个在 CPU 上运行的现有代码。您花费了大量时间将 OpenMP 或 MPI 放入代码中,以便可以使用 CPU 上的所有内核。您觉得代码调整得很好,但一位朋友告诉您,将代码移植到 GPU 可能会让您受益更多。您的应用程序中有超过 10,000 行代码,并且您知道在 GPU 上运行代码需要付出相当大的努力。此时,您对在 GPU 上运行的前景感兴趣,因为您喜欢学习新事物,并且您相信朋友的见解。现在,你必须向你的同事和你的老板说明情况。
Suppose you have an existing code that runs on CPUs. You’ve spent a lot of time putting OpenMP or MPI into the code so that you can use all of the cores on the CPU. You feel like the code is well tuned, but a friend has told you that you might benefit more by porting your code to GPUs. You have more than 10,000 lines of code in your application, and you know that it will take considerable effort to get your code running on GPUs. At this point, you’re interested in the prospect of running on GPUs because you like learning new things and you trust your friend’s insights. Now, you have to make the case to your colleagues and your boss.
应用程序的重要措施是减少连续运行数天的作业的求解时间。了解缩短求解时间的影响的最佳方法是看一个例子。我们将使用 Cloverleaf 应用程序作为本研究的代理。
The important measure for your application is to reduce the time-to-solution for jobs that run days at a stretch. The best way to get across the impact of a reduction in time-to-solution is to look at an example. We’ll use the Cloverleaf application as a proxy for this study.
Now let’s get the run time for a couple of possible replacement platforms.
不错!时间不到以前的一半。您即将准备购买此系统,但随后认为也许您应该查看您一直听说的那些 GPU。
Not bad! Less than half the time as before. You are about ready to purchase this system but then think that maybe you should check out those GPUs that you have been hearing about.
这是 GPU 性能提升的典型情况吗?应用程序的性能提升范围很广,但这些结果并不罕见。
Is this typical of the performance gains with GPUs? Performance gains for applications span a wide range, but these results are not unusual.
对于并行应用来说,能源成本变得越来越重要。以前,计算机的能耗不是一个问题,而现在,运行计算机、存储磁盘和冷却系统的能源成本正迅速接近与计算系统生命周期内硬件购买成本相同的水平。
Energy costs are becoming increasingly important for parallel applications. Where once the energy consumption of computers was not a concern, now the energy costs of running the computers, storage disks, and cooling system are fast approaching the same levels as the hardware purchase costs over the lifetime of the computing system.
在百万兆次级计算的竞争中,最大的挑战之一是将百万兆次级系统的功率要求保持在 20 MW 左右。相比之下,这大约足够为 13,000 户家庭供电。数据中心的装机功率根本不足,无法远远超出此范围。另一方面,智能手机、平板电脑和笔记本电脑使用电池运行,可用能量有限(两次充电之间)。在这些设备上,专注于降低计算的能源成本以延长电池寿命可能是有益的。幸运的是,为减少能源使用而做出的积极努力使电力需求的增长速度保持合理。
In the race to Exascale computing, one of the biggest challenges is keeping the power requirements for the Exascale system to around 20 MW. For comparison, this is about enough power to supply 13,000 homes. There simply isn’t enough installed power in data centers to go much beyond that. At the other end of the spectrum, smart phones, tablets, and laptops run on batteries with a limited amount of available energy (between charges). On these devices, it can be beneficial to focus on reducing energy costs for a computation to stretch out the battery life. Fortunately, aggressive efforts to reduce energy usage have kept the rate of an increase in power demands reasonable.
如果不直接测量功耗,准确计算应用的能源成本是具有挑战性的。但是,您可以通过将制造商的热设计功耗 (TDP) 乘以应用程序的运行时间和使用的处理器数量来获得更高的成本限制。TDP 是在典型运行负载下消耗能量的速率。您的应用能耗可以使用以下公式估算
Accurately calculating the energy costs of an application is challenging without direct measurements of power usage. However, you can get a higher bound on the cost by multiplying the manufacturer’s thermal design power (TDP) by the run time of the application and the number of processors used. TDP is the rate at which energy is expended under typical operational loads. The energy consumption for your application can be estimated using the formula
能量 = (N 个处理器) × (R瓦特/处理器) × (T 小时)
Energy = (N Processors) × (RWatts/Processor) × (T hours)
其中 Energy 是能耗,N 是处理器数量,R 是 TDP,T 是应用程序的运行时间。让我们将基于 GPU 的系统与大致等效的 CPU 系统进行比较(表 9.6)。我们假设我们的应用程序受内存限制,因此我们将计算 10 TB/秒系统的成本和能耗。
where Energy is the energy consumption, N is the number of processors, R is the TDP, and T is the application’s run time. Let’s compare a GPU-based system to a roughly equivalent CPU system (table 9.6). We’ll assume our application is memory bound, so we’ll calculate the costs and energy consumption for a 10 TB/sec system.
表 9.6 设计 10 TB/s 带宽的 GPU 和 CPU 系统。
Table 9.6 Designing a 10 TB/s bandwidth GPU and CPU system.
为了计算一天的能源成本,我们采用表 9.6 中的规格并计算名义能源成本。
To calculate the energy costs for one day, we take the specifications from table 9.6 and calculate the nominal energy costs.
一般来说,GPU 的 TDP 高于 CPU(300 瓦,而表 9.6 中为 140 瓦),因此它们消耗能量的速率更高。但 GPU 可能会减少运行时间,或者只需要几个 GPU 来运行计算。可以像以前一样使用相同的公式,其中 N 现在被视为 GPU 的数量。
In general, GPUs have a higher TDP than CPUs (300 watts vs. 140 watts from table 9.6) so they consume energy at a higher rate. But GPUs can potentially reduce run time or require only a few to run your calculation. The same formula can be used as before, where N is now seen as the number of GPUs.
现在我们可以看到 GPU 系统有很大的潜力,但我们使用的标称值可能与现实相去甚远。我们可以通过测量算法的性能和能量消耗来进一步优化此估计。
Now we can see there is great potential for a GPU system, but the nominal values that we used might be quite a ways from reality. We could further refine this estimate by getting measured performance and energy draws for our algorithm.
通过 GPU 加速器设备降低能源成本要求应用程序公开足够的并行性,并有效利用设备的资源。在假设的示例中,当在 12 个 GPU 上运行时,我们能够将能耗减少一半,而在 45 个完全订阅的 CPU 处理器上执行所需的时间相同。能源消耗公式还提出了降低能源成本的其他策略。我们稍后会讨论这些策略,但重要的是要注意,一般来说,GPU 每单位时间消耗的能量比 CPU 多。我们首先检查单个 CPU 处理器和单个 GPU 之间的能耗。
Achieving a reduction in energy cost through GPU accelerator devices requires that the application expose sufficient parallelism and that the device’s resources are efficiently utilized. In the hypothetical example, we were able to reduce the energy usage in half when running on 12 GPUs for the same amount of time it takes to execute on 45 fully subscribed CPU processors. The formula for energy consumption also suggests other strategies for reducing energy costs. We’ll discuss these strategies in a bit, but it’s important to note that, in general, a GPU consumes more energy than a CPU per unit time. We’ll start by examining the energy consumption between a single CPU processor and a single GPU.
清单 9.3 显示了如何绘制 V100 GPU 的功率和利用率。
Listing 9.3 shows how to plot the power and utilization for the V100 GPU.
清单 9.3 绘制 nvidia-smi 的功率和利用率数据
Listing 9.3 Plotting the power and utilization data from nvidia-smi
power_plot.py
1 import matplotlib.pyplot as plt
2 import numpy as np
3 import re
4 from scipy.integrate import simps
5
6 fig, ax1 = plt.subplots()
7
8 gpu_power = []
9 gpu_time = []
10 sm_utilization = []
11
12 # Collect the data from the file, ignore empty lines
13 data = open('gpu_monitoring.log', 'r')
14
15 count = 0
16 energy = 0.0
17 nominal_energy = 0.0
18
19 for line in data:
20 if re.match('^ 2019',line):
21 line = line.rstrip("\n")
22 dummy, dummy, dummy, gpu_power_in, dummy, dummy, sm_utilization_in, dummy,
dummy, dummy, dummy, dummy, dummy, dummy, dummy, dummy = line.split()
23 if (float(sm_utilization_in) > 80):
24 gpu_power.append(float(gpu_power_in))
25 sm_utilization.append(float(sm_utilization_in))
26 gpu_time.append(count)
27 count = count + 1
28 energy = energy + float(gpu_power_in)*1.0 ❶
29 nominal_energy = nominal_energy + float(300.0)*1.0 ❷
30
31 print(energy, "watts-secs", simps(gpu_power, gpu_time)) ❸
32 print(nominal_energy, "watts-secs", " ratio ",energy/nominal_energy*100.0) ❹
33
34 ax1.plot(gpu_time, gpu_power, "o", linestyle='-', color='red')
35 ax1.fill_between(gpu_time, gpu_power, color='orange')
36 ax1.set_xlabel('Time (secs)',fontsize=16)
37 ax1.set_ylabel('Power Consumption (watts)',fontsize=16, color='red')
38 #ax1.set_title('GPU Power Consumption from nvidia-smi')
39
40 ax2 = ax1.twinx() # instantiate a second axes that shares the same x-axis
41
42 ax2.plot(gpu_time, sm_utilization, "o", linestyle='-', color='green')
43 ax2.set_ylabel('GPU Utilization (%)',fontsize=16, color='green')
44
45 fig.tight_layout()
56 plt.savefig("power.pdf")
57 plt.savefig("power.svg")
58 plt.savefig("power.png", dpi=600)
59
60 plt.show()power_plot.py
1 import matplotlib.pyplot as plt
2 import numpy as np
3 import re
4 from scipy.integrate import simps
5
6 fig, ax1 = plt.subplots()
7
8 gpu_power = []
9 gpu_time = []
10 sm_utilization = []
11
12 # Collect the data from the file, ignore empty lines
13 data = open('gpu_monitoring.log', 'r')
14
15 count = 0
16 energy = 0.0
17 nominal_energy = 0.0
18
19 for line in data:
20 if re.match('^ 2019',line):
21 line = line.rstrip("\n")
22 dummy, dummy, dummy, gpu_power_in, dummy, dummy, sm_utilization_in, dummy,
dummy, dummy, dummy, dummy, dummy, dummy, dummy, dummy = line.split()
23 if (float(sm_utilization_in) > 80):
24 gpu_power.append(float(gpu_power_in))
25 sm_utilization.append(float(sm_utilization_in))
26 gpu_time.append(count)
27 count = count + 1
28 energy = energy + float(gpu_power_in)*1.0 ❶
29 nominal_energy = nominal_energy + float(300.0)*1.0 ❷
30
31 print(energy, "watts-secs", simps(gpu_power, gpu_time)) ❸
32 print(nominal_energy, "watts-secs", " ratio ",energy/nominal_energy*100.0) ❹
33
34 ax1.plot(gpu_time, gpu_power, "o", linestyle='-', color='red')
35 ax1.fill_between(gpu_time, gpu_power, color='orange')
36 ax1.set_xlabel('Time (secs)',fontsize=16)
37 ax1.set_ylabel('Power Consumption (watts)',fontsize=16, color='red')
38 #ax1.set_title('GPU Power Consumption from nvidia-smi')
39
40 ax2 = ax1.twinx() # instantiate a second axes that shares the same x-axis
41
42 ax2.plot(gpu_time, sm_utilization, "o", linestyle='-', color='green')
43 ax2.set_ylabel('GPU Utilization (%)',fontsize=16, color='green')
44
45 fig.tight_layout()
56 plt.savefig("power.pdf")
57 plt.savefig("power.svg")
58 plt.savefig("power.png", dpi=600)
59
60 plt.show()
❶ Integrates power times time to get energy in watts/s
❷ Gets the energy usage based on nominal power specification
❸ 打印计算出的能量并使用 scipy 的 simps 积分函数
❸ Prints the calculated energy and uses the simps integration function from scipy
❹ Calculates the actual vs. nominal energy usage
图 9.11 显示了结果图。同时,我们对曲线下的面积进行积分,得到能量使用情况。请注意,即使利用率为 100%,功率率也仅为标称 GPU 功率规格的 61% 左右。空闲时,GPU 功耗约为标称值的 20%。这表明 GPU 的实际功耗率明显低于基于标称规格的估计值。CPU 也低于标称数量,但可能没有其标称速率的百分比那么大。
Figure 9.11 shows the resulting plot. At the same time, we integrate the area under the curve to get the energy usage. Note that even with the utilization at 100%, the power rate is only about 61% of the nominal GPU power specification. At idle, the GPU power consumption is around 20% of the nominal amount. This shows that the real power usage rate for GPUs is significantly lower than estimates based on nominal specifications. CPUs also are lower than the nominal amount, but probably not by as great a percentage of their nominal rate.
图 9.11 在 V100 上运行的 CloverLeaf 问题的功耗。我们在功率曲线下积分,在持续约 60 秒的运行中获得 10.4 kJ。功耗率约为 V100 GPU 标称功率规格的 61%。
Figure 9.11 Power consumption for the CloverLeaf problem running on the V100. We integrate under the power curve to get 10.4 kJ for a run that lasted about 60 seconds. The rate of power consumption is about 61% of the nominal power specification for the V100 GPU.
When will multi-GPU platforms save you energy?
一般来说,并行效率会随着您添加更多的 CPU 或 GPU 而下降(还记得第 1.2 节中的阿姆达尔定律吗?),并且计算作业的成本也会上升。有时,如果作业提前完成并且可以传输或删除数据,则与作业的总运行时间相关的固定成本(如存储)会减少。但是,通常的情况是,您有一套要运行的作业,每个作业可以选择多少个处理器。以下示例突出显示了这种情况下的权衡。
In general, parallel efficiency drops off as you add more CPUs or GPUs (remember Amdahl’s law from section 1.2?), and the cost for a computational job goes up. Sometimes, there are fixed costs (such as storage) associated with the overall run time of a job that are reduced if the job is finished sooner and the data can be transferred or deleted. The usual situation, however, is that you have a suite of jobs to run with a choice of how many processors for each job. The following example highlights the tradeoffs in this situation.
此示例表明,如果我们正在优化大量作业的运行时,则通常最好使用较少的并行度。相比之下,如果我们更关心单个作业的周转时间,处理器越多越好。
This example shows that if we are optimizing the run time for a large suite of jobs, it is often better to use less parallelism. In contrast, if we are more concerned with the turnaround time for a single job, more processors will be better.
Google 和 Amazon 的云计算服务可让您将工作负载与各种计算服务器类型和需求相匹配。
Cloud computing services from Google and Amazon let you match your workloads to a wide range of compute server types and demands.
如果您的应用程序受内存限制,则可以使用具有较低 flops-to-loads ratio(浮点运算负载比)且成本较低的 GPU。
If your application is memory bound, you can use a GPU that has a lower flops-to-loads ratio at a lower cost.
If you are more concerned with turnaround time, you can add more GPUs or CPUs.
If your deadlines are less serious, you can use preemptible resources at a considerable reduction in cost.
随着云计算服务的计算成本更加明显,优化应用程序的性能成为更高的优先级。云计算的优势在于,您可以访问比现场拥有的更广泛的硬件种类,并且有更多选项来将硬件与工作负载相匹配。
As the cost of computing is more visible with cloud computing services, optimizing your application’s performance becomes a higher priority. Cloud computing has the advantage of giving you access to a wider variety of hardware than you can have on-site and more options to match the hardware to the workload.
GPU 不是通用处理器。当计算工作负载类似于图形工作负载 — 许多操作相同时,它们最合适。GPU 在某些方面仍然表现不佳,尽管随着 GPU 硬件和软件的每次迭代,其中一些问题都得到了解决。
GPUs are not general-purpose processors. They are most appropriate when the computation workload is similar to a graphics workload—lots of operations that are identical. There are some areas where GPUs still do not perform well, although with each iteration of the GPU hardware and software, some of these are addressed.
缺乏平行性 — 套用蜘蛛侠的话,“能力越大,对平行性的需求就越大。如果您没有并行性,GPU 就无法为您做很多事情。这是 GPGPU 编程的第一定律。
Lack of parallelism—To paraphrase Spiderman, “With great power comes great need for parallelism.” If you don’t have the parallelism, GPUs can’t do a lot for you. This is the first law of GPGPU programming.
不规则的内存访问 — CPU 也难以解决这一问题。GPU 的大规模并行性对这种情况没有任何好处。这是 GPGPU 编程的第二定律。
Irregular memory access—CPUs also struggle with this. The massive parallelism of GPUs brings no benefit to this situation. This is the second law of GPGPU programming.
线程发散 — GPU 上的线程都在每个分支上执行。这是 SIMD 和 SIMT 架构的一个特性(参见第 1.4 节)。少量的短分支很好,但截然不同的分支路径效果不佳。
Thread divergence—Threads on GPUs all execute on each and every branch. This is a characteristic of SIMD and SIMT architectures (see section 1.4). Small amounts of short branching are fine, but wildly different branch paths do poorly.
Dynamic memory requirements—Memory allocation is done on the CPU, which severely limits algorithms that require memory sizes determined on the fly.
递归算法 — GPU 的堆栈内存资源有限,供应商经常声明不支持递归。但是,在 5.5.2 节的网格到网格重映射算法中,已经证明有限数量的递归有效。
Recursive algorithms—GPUs have limited stack memory resources, and suppliers often state that recursion is not supported. However, a limited amount of recursion has been demonstrated to work in the mesh-to-mesh remapping algorithms in section 5.5.2.
GPU 架构随着硬件设计的每次迭代而不断发展。我们建议您继续跟踪最新的发展和创新。一开始,GPU 架构首先是为了图形性能。但市场也已经扩展到机器学习和计算。
GPU architectures continue to evolve with each iteration of hardware design. We suggest that you continue to track the latest developments and innovations. At the outset, GPU architectures were first and foremost for graphics performance. But the market has broadened into machine learning and computation as well.
有关 STREAM Benchmark 性能及其在不同并行编程语言中有何变化的更详细讨论,请参阅以下论文:
For a much more detailed discussion of STREAM Benchmark performance and how it varies across parallel programming languages, we refer you to the following paper:
T. Deakin、J. Price 等人,“跨各种并行编程模型对众核处理器可实现的内存带宽进行基准测试”,GPU-STREAM,v2.0 (2016)。论文发表于德国法兰克福 ISC High Performance 的众核或加速器性能可移植编程模型 (P^3MA) 研讨会。
T. Deakin, J. Price, et al., “Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models,” GPU-STREAM, v2.0 (2016). Paper presented at Performance Portable Programming models for Manycore or Accelerators (P^3MA) Workshop at ISC High Performance, Frankfurt, Germany.
可以在 Lawrence Berkeley 实验室找到有关 GPU 车顶模型的优秀资源。一个好的起点是
A good resource on the roofline model for GPUs can be found at Lawrence Berkeley Lab. A good starting point is
Charlene Yang 和 Samuel Williams,“使用 Roofline 模型对 GPU 加速应用程序进行性能分析”,GPU 技术会议(2019 年),https://crd.lbl.gov/assets/Uploads/GTC19-Roofline.pdf 年。
Charlene Yang and Samuel Williams, “Performance Analysis of GPU-Accelerated Applications using the Roofline Model,” GPU Technology Conference (2019) available at https://crd.lbl.gov/assets/Uploads/GTC19-Roofline.pdf.
在本章中,我们通过假设简单的应用程序性能要求,展示了 Mixbench 性能模型的简化视图。以下论文提出了一个更详尽的过程来解释实际应用程序的复杂性:
In this chapter, we presented a simplified view of the mixbench performance model by assuming simple application performance requirements. The following paper presents a more thorough procedure to account for the complications of real applications:
Elias Konstantinidis 和 Yiannis Cotronis,“使用微基准测试和硬件指标分析进行 GPU 内核性能估算的定量屋顶模型”。并行与分布式计算杂志 107 (2017):37-56。
Elias Konstantinidis and Yiannis Cotronis, “A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling.” Journal of Parallel and Distributed Computing 107 (2017): 37-56.
表 9.7 显示了 1 flop/load 应用可实现的性能。查找市场上可用的 GPU 的当前价格,并填写最后两列以获得每个 GPU 的每美元翻牌。哪个看起来最划算?如果应用程序运行时间的周转时间是最重要的标准,那么最好购买哪种 GPU?
表 9.7 使用各种 GPU 的 1 flop/load 应用程序可实现的性能
Table 9.7 shows the achievable performance for a 1 flop/load application. Look up the current prices for the GPUs available on the market and fill in the last two columns to get the flop per dollar for each GPU. Which looks like the best value? If turnaround time for your application run time is the most important criterion, which GPU would be best to purchase?
Table 9.7 Achievable performance for a 1 flop/load application with various GPUs
Measure the stream bandwidth of your GPU or another selected GPU. How does it compare to the ones presented in the chapter?
Use the likwid performance tool to get the CPU power requirements for the CloverLeaf application on a system where you have access to the power hardware counters.
The CPU-GPU system can provide a powerful boost for many parallel applications. It should be considered for any application with a lot of parallel work.
The GPU component of the system is in reality a general-purpose parallel accelerator. This means that it should be given the parallel part of the work.
通过 PCI 总线和内存带宽传输数据是 CPU-GPU 系统上最常见的性能瓶颈。管理数据传输和内存使用对于获得良好性能非常重要。
Data transfer over the PCI bus and memory bandwidth are the most common performance bottlenecks on CPU-GPU systems. Managing the data transfer and memory use is important for good performance.
You’ll find a wide range of GPUs available for different workloads. Selecting the most suitable model will give the best price to performance ratio.
GPUs can reduce time-to-solution and energy costs. This can be a prime motivator in porting an application to GPUs.
在本章中,我们将开发一个抽象模型,说明如何在 GPU 上执行工作。此编程模型适合来自不同供应商的各种 GPU 设备,以及每个供应商的模型。它也是一个比实际硬件上出现的模型更简单的模型,只捕获了开发应用程序所需的基本方面。幸运的是,各种 GPU 在结构上有很多相似之处。这是高性能图形应用程序需求的自然结果。
In this chapter, we will develop an abstract model of how work is performed on GPUs. This programming model fits a variety of GPU devices from different vendors and across the models from each vendor. It is also a simpler model than what occurs on the real hardware, capturing just the essential aspects required to develop an application. Fortunately, various GPUs have a lot of similarities in structure. This is a natural result of the demands of high-performance graphics applications.
数据结构和算法的选择对 GPU 的性能和编程的便利性有长期影响。有了一个好的 GPU 心智模型,你可以规划数据结构和算法如何映射到 GPU 的并行度。特别是对于 GPU,作为应用程序开发人员,我们的主要工作是尽可能多地公开并行性。由于需要利用数千个线程,我们需要从根本上改变工作,以便在线程之间分配许多小任务。与 任何其他并行编程语言一样,在 GPU 语言中,必须存在多个组件。这些是
The choice of data structures and algorithms has a long-range impact on the performance and ease of programming for the GPU. With a good mental model of the GPU, you can plan how data structures and algorithms map to the parallelism of the GPU. Especially for GPUs, our primary job as application developers is to expose as much parallelism as we can. With thousands of threads to harness, we need to fundamentally change the work so that there are a lot of small tasks to distribute across the threads. In a GPU language, as in any other parallel programming language, there are several components that must exist. These are a way to
Express the computational loops in a parallel form for the GPU (see section 10.2)
Move data between the host CPU and the GPU compute device (see section 10.2.4)
Coordinate between threads that are needed for a reduction (see section 10.4)
了解这三个组件是如何在每种 GPU 编程语言中完成的。在某些语言中,您可以直接控制某些方面,而在另一些语言中,您依靠编译器或模板编程来实现所需的操作。虽然 GPU 的操作可能看起来很神秘,但这些操作与 CPU 上并行代码所需的操作并没有什么不同。我们必须编写对细粒度并行性安全的循环,有时称为 fortran 的 do concurrent,或者 C/C++ 中的 forall 或 foreach。我们必须考虑节点、进程和处理器之间的数据移动。我们还必须有专门的减排机制。
Look for how these three components are accomplished in each GPU programming language. In some languages, you directly control some aspects, and in others, you rely on the compiler or template programming to implement the needed operations. While the operation of a GPU might seem mysterious, these operations are not all that different from what is necessary on a CPU for parallel code. We have to write loops that are safe for fine-grained parallelism, sometimes called do concurrent for Fortran or forall or foreach in C/C++. We have to think about data movement between nodes, processes, and the processor. We also have to have special mechanisms for reductions.
对于 CUDA 和 OpenCL 等原生 GPU 计算语言,编程模型作为语言的一部分公开。这些 GPU 语言在第 12 章中介绍。在该章中,您将显式管理程序中 GPU 并行化的许多方面。但是,借助我们的编程模型,您将更好地准备做出重要的编程决策,以获得更好的性能和在各种 GPU 硬件上进行扩展。
For native GPU computation languages like CUDA and OpenCL, the programming model is exposed as part of the language. These GPU languages are covered in chapter 12. In that chapter, you’ll explicitly manage many aspects of parallelization for the GPU in your program. But with our programming model, you will be better prepared to make important programming decisions for better performance and scaling across a wide range of GPU hardware.
如果您使用的是更高级别的编程语言,例如第 11 章中介绍的基于 pragma 的 GPU 语言,您真的需要了解 GPU 编程模型的所有细节吗?即使使用 pragmas,了解工作是如何分发的仍然很有帮助。当您使用 pragma 时,您是在尝试引导编译器和库执行正确的操作。在某些方面,这比直接编写程序更难。
If you are using a higher-level programming language, such as the pragma-based GPU languages covered in chapter 11, do you really need to understand all the details of the GPU programming model? Even with pragmas, it is still helpful to understand how the work gets distributed. When you use a pragma, you are trying to steer the compiler and library to do the right thing. In some ways, this is harder than writing the program directly.
本章的目标是帮助您开发 GPU 的应用程序设计。这在很大程度上与 GPU 的编程语言无关。您应该提前回答一些问题。您将如何组织您的工作,可以期待什么样的表演?或者更基本的问题,即您的应用程序是否应该移植到 GPU,或者继续使用 CPU 会更好?GPU 有望实现数量级的性能提升和更低的能耗,是一个引人注目的平台。但这些并不是每个应用程序和用例的灵丹妙药。让我们深入了解 GPU 编程模型的细节,看看它能为您做些什么。
The goal of this chapter is to help you develop your application design for the GPU. This is mostly independent of the programming language for the GPU. There are questions you should answer up front. How will you organize your work and what kind of performance can be expected? Or the more basic question of whether your application should even be ported to the GPU or would it be better off staying with the CPU? GPUs, with their promise of an order-of-magnitude performance gains and lower energy use, are a compelling platform. But these are not a panacea for every application and use case. Let’s dive into the details of the GPU’s programming model and see what it can do for you.
注意我们鼓励您按照 https://github.com/EssentialsofParallelComputing/Chapter10 中的本章示例进行操作。
Note We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/Chapter10.
GPU 编程抽象是可能的,这是有原因的。我们稍后将更详细地探讨基本特征,包括以下内容。然后,我们将快速浏览一下 GPU 并行性的一些基本术语。
The GPU programming abstractions are possible for a reason. The basic characteristics, which we explore in more detail in a bit, include the following. Then we’ll take a quick look at some basic terminology for GPU parallelism.
抽象基于使用 GPU 的高性能图形所必需的内容。GPU 工作流具有一些特殊特性,有助于推动 GPU 处理技术的通用性。对于高帧速率和高质量图形,需要处理和显示大量像素、三角形和多边形。
Abstractions are based on what is necessary for high-performance graphics with GPUs. GPU workflows have some special characteristics that help to drive the commonality in the GPU-processing techniques. For a high frame rate and high-quality graphics, there are lots of pixels, triangles, and polygons to process and display.
由于数据量大,GPU 具有大规模的并行性。对数据的操作通常是相同的,因此 GPU 使用类似的技术将单个指令应用于多个数据项,以获得另一个级别的效率。图 10.1 显示了各种供应商和 GPU 模型的常见编程抽象。这些可以概括为三四种基本技术。
Because of the large amounts of data, GPUs have massive parallelism. The operations on the data are generally identical, so GPUs use similar techniques to apply a single instruction to multiple data items to gain another level of efficiency. Figure 10.1 shows the common programming abstractions across various vendors and GPU models. These can be summarized as three or four basic techniques.
图 10.1 我们的 GPU 并行化心智模型包含大多数 GPU 硬件的常见编程抽象。
Figure 10.1 Our mental model for GPU parallelization contains the common programming abstractions across most GPU hardware.
我们从计算域开始,用以下组件迭代地分解工作。我们将在 10.1.4 到 10.1.8 节中讨论这些工作细分:
We start with the computational domain and iteratively break up the work with the following components. We’ll discuss each of these subdivisions of work in sections 10.1.4 through 10.1.8:
从这些 GPU 并行抽象中需要注意的一点是,基本上有三个或四个不同级别的并行化可以应用于计算循环。在原始图形用例中,不需要超出二维或三维以及相应的并行化级别数量。如果您的算法具有更多维度或级别,则必须组合一些计算循环才能完全并行化您的问题。
One thing to note from these GPU parallel abstractions is that there are fundamentally three, or maybe four, different levels of parallelization that you can apply to a computational loop. In the original graphics use case, there is not much of a need to go beyond two or three dimensions and the corresponding number of parallelization levels. If your algorithm has more dimensions or levels, you must combine some computational loops to fully parallelize your problem.
图形工作负载不需要在操作中进行太多协调。但正如我们将在后面的章节中看到的那样,有一些算法(如 reductions)需要协调。我们将不得不制定复杂的计划来处理这些情况。
Graphics workloads do not require much coordination within the operations. But as we will see in later sections, there are algorithms such as reductions that require coordination. We will have to develop complicated schemes to handle these situations.
GPU 并行性组件的术语因供应商而异,在阅读编程文档或文章时增加了一定程度的混淆。为了帮助交叉引用各种术语的使用,我们在表 10.1 中总结了每个供应商的官方术语。
The terminology for components of the GPU parallelism varies across vendors, adding a degree of confusion when reading programming documentation or articles. To help with cross-referencing the use of various terms, we summarize the official terms from each vendor in table 10.1.
Table 10.1 Programming abstractions and associated terminology for GPUs
OpenCL 是 GPU 编程的开放标准,因此我们将其用作基本术语。OpenCL 可在所有 GPU 硬件和许多其他设备上运行,例如 CPU,甚至在更奇特的硬件上运行,例如现场可编程门阵列 (FPGA) 和其他嵌入式设备。CUDA 是 NVIDIA 的 GPU 专有语言,是 GPU 计算中使用最广泛的语言,因此,在有关编程 GPU 的文档中占很大一部分。HIP(用于可移植性的异构计算接口)是 AMD 为其 GPU 开发的 CUDA 的便携式衍生产品。它使用与 CUDA 类似的术语。本机 AMD 异构计算 (HC) 编译器和 Microsoft 的 C++ AMP 语言使用许多相同的术语。(在撰写本文时,C++ AMP 处于维护模式,尚未处于积极开发状态。在尝试获得可移植性能时,考虑 CPU 的相应功能和术语也很重要,如表 10.1 的最后一列所示。
OpenCL is the open standard for GPU programming, so we use it as the base terminology. OpenCL runs on all of the GPU hardware and many other devices such as CPUs and even more exotic hardware such as field-programmable gate arrays (FPGAs) and other embedded devices. CUDA, the NVIDIA proprietary language for their GPUs, is the most widely used language for GPU computation and, thus, used in a great fraction of the documentation on programming GPUs. HIP (Heterogeneous-Computing Interface for Portability) is a portable derivative of CUDA developed by AMD for their GPUs. It uses similar terminology as CUDA. The native AMD Heterogeneous Compute (HC) Compiler and the C++ AMP language from Microsoft use a lot of the same terms. (C++ AMP is in maintenance mode and not under active development as of this writing.) When trying to get portable performance, it’s also important to consider the corresponding features and terms for the CPU as shown in the last column in table 10.1.
数据分解技术是 GPU 获得性能的核心。GPU 将问题分解为许多较小的数据块。然后他们一次又一次地打破它。
The technique of data decomposition is at the heart of how GPUs obtain performance. GPUs break up the problem into many smaller blocks of data. Then they break it up again, and again.
GPU 必须绘制大量三角形和多边形才能生成高帧速率。这些操作彼此完全独立。因此,GPU 上计算工作的顶级数据分解也会生成独立的异步工作。
GPUs must draw a lot of triangles and polygons to generate high frame rates. These operations are completely independent from each other. For this reason, the top-level data decomposition for computational work on a GPU also generates independent and asynchronous work.
由于有很多工作要做,GPU 通过切换到另一个准备好计算的工作组来隐藏延迟(内存负载停滞)。图 10.2 显示了由于资源限制而只能调度四个子组(warp 或 wavefronts)的情况。当子组遇到内存读取并停止时,执行将切换到其他子组。执行切换(也称为上下文切换)通过计算而不是深度缓存层次结构来隐藏延迟。如果单个数据上只有一个指令流,则 GPU 会很慢,因为它无法隐藏延迟。但是,如果您有大量数据要操作,则速度非常快。
With lots of work to do, GPUs hide latency (stalls for memory loads) by switching to another work group that is ready to compute. Figure 10.2 shows a case where only four subgroups (warps or wavefronts) can be scheduled due to resource limitations. When the subgroups hit a memory read and stall, execution switches to other subgroups. The execution switch, also called a context switch, is hiding latency with computation rather than with a deep cache hierarchy. If you only have a single instruction stream on a single piece of data, a GPU will be slow because it has no way to hide the latency. But if you have lots of data to operate on, it’s incredibly fast.
图 10.2 GPU 子组 (warp) 调度程序切换到其他子组以覆盖内存读取和指令停顿。多个工作组允许完成工作,即使正在同步工作组也是如此。
Figure 10.2 The GPU subgroup (warp) scheduler switches to other subgroups to cover memory reads and instruction stalls. Multiple work groups allow work to be done even when a work group is being synchronized.
表 10.2 显示了当前 NVIDIA 和 AMD 调度程序的设备限制。对于这些设备,我们需要大量的候选工作组和子组,以保持处理元素的繁忙。
Table 10.2 shows the device limitations for the current NVIDIA and AMD schedulers. For these devices, we want a high number of candidate work groups and subgroups to keep the processing elements busy.
表 10.2 GPU 子组(warp 或 wavefront)调度程序限制
Table 10.2 GPU subgroup (warp or wavefront) scheduler limitations
数据移动,尤其是缓存层次结构中上下移动数据,是处理器能源成本的很大一部分。因此,减少对深度缓存层次结构的需求具有一些显著的好处。能源使用量大大减少。此外,处理器上释放了大量宝贵的硅空间。然后,可以用更多的 arithmetic logic units (ALU) 填充这个空间。
Data movement and, in particular, moving data up and down the cache hierarchy, is a substantial part of the energy cost for a processor. Therefore, the reduction in the need for a deep cache hierarchy has some significant benefits. There is a large reduction in the energy usage. Also, a lot of precious silicon space is freed on the processor. This space can then be filled with more arithmetic logic units (ALUs).
我们在图 10.3 中展示了数据分解操作,其中 2D 计算域被分割成更小的 2D 数据块。在 OpenCL 中,这称为 NDRange,是 N 维范围的缩写(CUDA 术语,网格,更可口)。在本例中,NDRange 是一组大小为 8×8 的 3×3 图块。数据分解过程将全局计算域 Gy x 分解为大小为 Ty x 的较小块或瓦片。
We show the data decomposition operation in figure 10.3, where a 2D computational domain is split into smaller 2D blocks of data. In OpenCL, this is called an NDRange, short for N-dimensional range (the CUDA term, a grid, is a little more palatable). The NDRange in this case is a 3×3 set of tiles of size 8×8. The data decomposition process breaks up the global computational domain, Gy by Gx, into smaller blocks or tiles of size Ty by Tx.
Figure 10.3 Breaking up the computational domain into small, independent work units
Let’s work through an example to see what this step accomplishes.
表 10.3 显示了 1D、2D 和 3D 计算域如何进行这种数据分解的示例。变化最快的切片维度 Tx 应该是缓存行长度、内存总线宽度或子组(波前或 warp)大小的倍数,以获得最佳性能。瓦片的数量 NT 总体和每个维度的数量导致大量工作组(瓦片)分布在 GPU 计算引擎和处理元素之间。
Table 10.3 shows examples of how this data decomposition might occur for 1D-, 2D-, and 3D-computational domains. The fastest changing tile dimension, Tx, should be a multiple of the cache line length, memory bus width, or subgroup (wavefront or warp) size for best performance. The number of tiles, NT, overall and in each dimension, results in a lot of work groups (tiles) to distribute across the GPU compute engines and processing elements.
Table 10.3 Data decomposition of the computational domain into tiles or blocks
对于需要邻居信息的算法,需要平衡内存访问的最佳 tile 大小与获得 tile 的最小表面积(图 10.4)。对于相邻切片,必须多次加载相邻数据,这使得这是一个重要的考虑因素。
For algorithms that need neighbor information, the optimum tile size for memory accesses needs to be balanced against getting the minimum surface area for the tile (figure 10.4). Neighbor data must be loaded more than once for adjacent tiles, which makes this an important consideration.
图 10.4 每个工作组都需要从虚线矩形加载相邻数据,从而在阴影区域产生重复加载,而左侧的情况需要更多的重复加载。这必须与 x 方向上的最佳连续数据负载相平衡。
Figure 10.4 Each work group needs to load neighbor data from the dashed rectangle, resulting in duplicate loads in the shaded regions where more duplicate loads will be needed for the case on the left. This must be balanced against optimum contiguous data loads in the x-direction.
工作组将工作分散到计算单元上的线程中。每个 GPU 型号都有为硬件指定的最大大小。OpenCL 在其设备查询中将此报告为 CL_DEVICE_MAX_WORK_GROUP_SIZE。PGI 在其 pgaccelinfo 命令的输出中将其报告为 Maximum Threads per Block(参见图 11.3)。工作组的最大大小通常介于 256 到 1024 之间。这只是最大值。对于计算,工作组大小通常要小得多,因此每个工作项或线程的内存资源更多。
The work group spreads out the work across the threads on a compute unit. Each GPU model has a maximum size specified for the hardware. OpenCL reports this as CL_DEVICE_MAX_WORK_GROUP_SIZE in its device query. PGI reports it as Maximum Threads per Block in the output from its pgaccelinfo command (see figure 11.3). The maximum size for a work group is usually between 256 and 1,024. This is just the maximum. For computation, work group sizes are typically much smaller, so that there are more memory resources per work item or thread.
工作组被细分为子组或 warp(图 10.5)。子组是 lockstep 执行的线程集。对于 NVIDIA,warp 大小为 32 个线程。对于 AMD,它被称为波前,大小通常为 64 个工作项。工作组大小必须是子组大小的倍数。
The work group is subdivided into subgroups or warps (figure 10.5). A subgroup is the set of threads that execute in lockstep. For NVIDIA, the warp size is 32 threads. For AMD it is called a wavefront, and the size is usually 64 work items. The work group size must be a multiple of the subgroup size.
图 10.5 一个多维工作组被线性化到一个 1D 条带上,在那里它被分解成 32 或 64 个工作项的子组。出于性能原因,工作组应为子组大小的倍数。
Figure 10.5 A multi-dimensional work group is linearized onto a 1D strip where it is broken up into subgroups of 32 or 64 work items. For performance reasons, work groups should be multiples of the subgroup size.
The typical characteristics of work groups on GPUs are that they
本地存储器提供快速访问,可用作一种可编程高速缓存或暂存器存储器。如果工作组中的多个线程需要相同的数据,通常可以通过在内核开始时将其加载到本地内存中来提高性能。
Local memory provides fast access and can be used as a sort of programmable cache or scratchpad memory. If the same data is needed by more than one thread in a work group, performance can generally be improved by loading it into the local memory at the start of the kernel.
为了进一步优化图形操作, GPU 认识到可以对许多数据元素执行相同的操作。因此,GPU 通过使用单个指令处理数据集而不是为每个指令单独处理数据集来优化。这减少了需要处理的指令数。CPU 上的这种技术称为单指令多数据 (SIMD)。所有 GPU 都使用一组线程来模拟这一点,其中称为单指令、多线程 (SIMT)。有关 SIMD 和 SIMT 的原始讨论,请参见第 1.4 节。
To further optimize the graphics operations, GPUs recognize that the same operations can be performed on many data elements. GPUs are therefore optimized by working on sets of data with a single instruction rather than with separate instructions for each. This reduces the number of instructions that need to be handled. This technique on the CPU is called single instruction, multiple data (SIMD). All GPUs emulate this with a group of threads where it is called single instruction, multi-thread (SIMT). See section 1.4 for the original discussion of SIMD and SIMT.
由于 SIMT 模拟 SIMD 操作,因此它不一定像 SIMD 操作那样受到底层向量硬件的约束。当前的 SIMT 操作是锁步执行的,如果任何一个线程必须通过分支,则子组中的每个线程都通过分支执行所有路径(图 10.6)。这类似于使用掩码完成 SIMD 操作的方式。但是由于 SIMT 操作是模拟的,因此可以在 INSTRUCTION PIPELINE 中通过更大的灵活性来放松这一点,其中可以支持多个 INSTRUCTION。
Because SIMT simulates SIMD operations, it is not necessarily constrained the same way as are SIMD operations by the underlying vector hardware. Current SIMT operations are executed in lockstep, with every thread in the subgroup executing all paths through branching if any one thread must go through a branch (figure 10.6). This is similar to how a SIMD operation is done with a mask. But because the SIMT operation is emulated, this could be relaxed with more flexibility in the instruction pipeline, where more than one instruction could be supported.
图 10.6 阴影矩形按线程和通道显示已执行的语句。SIMD 和 SIMT 操作以锁步方式执行所有语句,并为 false 语句设置掩码。大块条件语句可能会导致 GPU 出现分支分歧问题。
Figure 10.6 The shaded rectangles show the executed statements by threads and lanes. SIMD and SIMT operations execute all the statements in lockstep with masks for those that are false. Large blocks of conditionals can cause branch divergence problems for GPUs.
GPU 的小部分条件对整体性能没有显著影响。但是,如果某些线程比其他线程多花费数千个周期,则存在严重问题。如果对线程进行分组,使得所有长分支都位于同一子组 (波前) 中,则线程发散将很少或没有。
Small sections of conditionals for GPUs do not have a significant impact on overall performance. But if some threads take thousands of cycles longer than others, there’s a serious issue. If threads are grouped such that all the long branches are in the same subgroup (wavefront), there will be little or no thread divergence.
基本操作单元在 OpenCL 中称为工作项。此工作项可以映射到线程或处理核心,具体取决于硬件实现。在 CUDA 中,它被简单地称为线程,因为这就是它在 NVIDIA GPU 中的映射方式。称其为线程是将编程模型与它在硬件中的实现方式混合在一起,但对于程序员来说,它更清楚一些。
The basic unit of operation is called a work item in OpenCL. This work item can be mapped to a thread or to a processing core, depending on the hardware implementation. In CUDA, it is simply called a thread because that is how it is mapped in NVIDIA GPUs. Calling it a thread is mixing the programming model with how it is implemented in the hardware, but it is a little clearer to the programmer.
如图 10.7 所示,工作项可以在具有向量硬件单元的 GPU 上调用另一个级别的并行度。此操作模型还映射到线程可以在其中执行向量操作的 CPU。
A work item can invoke another level of parallelism on GPUs with vector hardware units as figure 10.7 shows. This model of operation also maps to the CPU where a thread can execute a vector operation.
图 10.7 AMD 或 Intel GPU 上的每个工作项都可以执行 SIMD 或 Vector 操作。这也很好地映射到 CPU 上的向量单元。
Figure 10.7 Each work item on an AMD or Intel GPU may be able to do a SIMD or Vector operation. This maps well over to the vector unit on a CPU as well.
一些 GPU 还具有向量硬件单元,除了 SIMT 操作之外,还可以执行 SIMD(向量)操作。在图形世界中,向量单位处理空间模型或颜色模型。在科学计算中的使用更复杂,并且不一定在 GPU 硬件之间移植。vector 操作是按工作项完成的,从而提高了内核的资源利用率。但通常有额外的 vector registers 来补偿额外的工作。如果做得好,有效利用 vector 单元可以显着提高性能。
Some GPUs also have vector hardware units and can do SIMD (vector) operations in addition to SIMT operations. In the graphics world, the vector units process spatial or color models. The use in scientific computation is more complicated and not necessarily portable between GPU hardware. The vector operation is done per work item, increasing the resource utilization for the kernel. But often there are additional vector registers to compensate for the additional work. Effective utilization of the vector units can provide a significant boost to performance when done well.
向量运算以 OpenCL 语言和 AMD 语言公开。由于 CUDA 硬件没有向量单元,因此 CUDA 语言中不存在相同级别的支持。尽管如此,带有向量运算的 OpenCL 代码将在 CUDA 硬件上运行,因此可以在 CUDA 硬件中进行仿真。
Vector operations are exposed in the OpenCL language and AMD languages. Because the CUDA hardware does not have vector units, the same level of support is not present in CUDA languages. Still, OpenCL code with vector operations will run on CUDA hardware, so it can be emulated in the CUDA hardware.
现在我们可以开始查看包含编程模型的 GPU 的代码结构。为了方便和通用,我们将 CPU 称为主机,并使用术语 device 来指代 GPU。
Now we can begin to look at the code structure for the GPU that incorporates the programming model. For convenience and generality, we call the CPU the host and we use the term device to refer to the GPU.
GPU 编程模型从应用于函数的数组范围或索引集中拆分循环体。循环体创建 GPU 内核。索引集和参数将在主机上用于进行内核调用。图 10.8 显示了从标准循环到 GPU 内核主体的转换。此示例使用 OpenCL 语法。但 CUDA 内核类似,将 get_global_id 调用 替换为
The GPU programming model splits the loop body from the array range or index set that is applied to the function. The loop body creates the GPU kernel. The index set and arguments will be used on the host to make the kernel call. Figure 10.8 shows the transformation from a standard loop to the body of the GPU kernel. This example uses OpenCL syntax. But the CUDA kernel is similar, replacing the get_global_id call with
gid = blockIdx.x *blockDim.x + threadIdx.x
gid = blockIdx.x *blockDim.x + threadIdx.x
Figure 10.8 Correspondence between standard loop and the GPU kernel code structure
在接下来的 4 节中,我们将分别研究循环体如何成为并行内核,以及如何将其与主机上的索引集联系起来。让我们将其分解为四个步骤:
In the next four sections, we look separately at how the loop body becomes the parallel kernel and how to tie it back to the index set on the host. Let’s break this down into four steps:
GPU 编程是“我”一代的完美语言。在 kernel 中,一切都是相对于你自己的。举个例子
GPU programming is the perfect language for the “Me” generation. In the kernel, everything is relative to yourself. Take for example
c[i] = a[i] + scalar*b[i];
c[i] = a[i] + scalar*b[i];
在此表达式中,没有有关循环范围的信息。这可能是一个循环,其中 i(全局 i 索引)涵盖从 0 到 1,000 的范围,或者只涵盖单个值 22。每个数据项都知道需要对自己和自己做什么。这是真正的“我”编程模型,我只关心自己。这样做的强大之处在于,每个数据元素上的操作变得完全独立。让我们看看更复杂的模板运算符示例。尽管我们有两个索引,即 i 和 j,并且一些引用是针对相邻数据值的,但一旦我们确定了 i 和 j 的值,这行代码仍然是完全定义的。
In this expression, there is no information about the extent of the loop. This could be a loop where i, the global i index, covers a range from 0 to 1,000 or just the single value 22. Each data item knows what needs to be done to itself and itself only. This is truly a “Me” programming model, where I care only about myself. What is so powerful about this is that the operations on each data element become completely independent. Let’s look at the more complicated example of the stencil operator. Although we have two indices, both i and j, and some of the references are to adjacent data values, this line of code is still fully defined once we determine the values of i and j.
xnew[j][i] = (x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i])/5.0;
xnew[j][i] = (x[j][i] + x[j][i-1] + x[j][i+1] + x[j-1][i] + x[j+1][i])/5.0;
循环体和索引集的分离可以在 C++ 中使用函子或 lambda 表达式完成。在 C++ 中,lambda 表达式自 C++ 11 标准以来就已经存在。Lambda 用作编译器的一种方式,以便将单一源代码移植到 CPU 或 GPU。清单 10.1 显示了 C++ lambda。
The separation of the loop body and the index set can be done in C++ with either functors or lambda expressions. In C++, lambda expressions have been around since the C++ 11 standard. Lambdas are used as a way for compilers to provide portability for single-source code to either CPUs or GPUs. Listing 10.1 shows the C++ lambda.
定义 Lambda 表达式是未命名的本地函数,可以分配给变量并在本地使用或传递给例程。
Definition Lambda expressions are unnamed, local functions that can be assigned to a variable and used locally or passed to a routine.
Listing 10.1 C++ lambda for the stream triad
lambda.cc
1 int main() {
2 const int N = 100;
3 double a[N], b[N], c[N];
4 double scalar = 0.5;
5
6 // c, a, and b are all valid scope pointers on the device or host
7
8 // We assign the loop body to the example_lambda variable
9 auto example_lambda = [&] (int i) { ❶
10 c[i] = a[i] + scalar * b[i]; ❷
11 };
12
13 for (int i = 0; i < N; i++) ❸
14 {
15 example_lambda(i); ❹
16 }
17 }lambda.cc
1 int main() {
2 const int N = 100;
3 double a[N], b[N], c[N];
4 double scalar = 0.5;
5
6 // c, a, and b are all valid scope pointers on the device or host
7
8 // We assign the loop body to the example_lambda variable
9 auto example_lambda = [&] (int i) { ❶
10 c[i] = a[i] + scalar * b[i]; ❷
11 };
12
13 for (int i = 0; i < N; i++) ❸
14 {
15 example_lambda(i); ❹
16 }
17 }
❸ Arguments or index set for lambda
The lambda expression is composed of four main components:
c[i] = a[i] + scalar * b[i];.
c[i] = a[i] + scalar * b[i];.
Arguments—The argument (int i) used in the later call to the lambda expression.
Capture closure — 函数体中外部定义的变量列表,以及这些变量如何传递给例程,由清单 10.1 中的 [&] 指定。& 表示变量是通过引用引用的,= 符号表示按值复制它。单个 & 通过引用将默认值设置为变量。我们可以使用 [&c, &a, &b, &scalar] 的捕获规范更全面地指定变量。
Capture closure—The list of variables in the function body that are defined externally and how these are passed to the routine, specified by [&] in listing 10.1. The & indicates that the variable is referred to by reference and an = sign says to copy it by value. A single & sets the default to variables by reference. We can more fully specify the variables with the capture specification of [&c, &a, &b, &scalar].
Invocation—The for loop in lines 13 to 16 in listing 10.1 invokes the lambda over the specified array values.
Lambda 表达式构成了在新兴 C++ 语言(如 SYCL、Kokkos 和 Raja)中为 GPU 更自然地生成代码的基础。我们将在第 12 章中简要介绍 SYCL 作为高级 C++ 语言(最初构建在 OpenCL 之上)。来自桑迪亚国家实验室 (SNL) 的 Kokkos 和源自劳伦斯利弗莫尔国家实验室 (LLNL) 的 Raja 是两种高级语言,旨在简化当今各种计算硬件的便携式科学应用程序的编写。我们也将在第 12 章介绍 Kokkos 和 Raja。
Lambda expressions form the basis for more naturally generating code for GPUs in emerging C++ languages like SYCL, Kokkos, and Raja. We will briefly cover SYCL in chapter 12 as a higher-level C++ language (originally built on top of OpenCL). Kokkos from Sandia National Laboratories (SNL) and Raja, originating at Lawrence Livermore National Laboratory (LLNL), are two higher-level languages developed to simplify the writing of portable scientific applications for the broad array of today’s computing hardware. We’ll introduce Kokkos and Raja in chapter 12 as well.
内核如何编写其本地操作的关键是,作为数据分解的产物,我们为每个工作组提供一些有关其在本地和全局域中的位置的信息。在 OpenCL 中,您可以获得以下信息:
The key to how the kernel can compose its local operation is that, as a product of the data decomposition, we provide each work group with some information about where it is in the local and global domains. In OpenCL, you can get the following information:
Dimension—Gets the number of dimensions, either 1D, 2D, or 3D, for this kernel from the kernel invocation
Global information—Global index in each dimension, which corresponds to a local work unit, or the global size in each dimension, which is the size of the global computational domain in each dimension
Local (tile) information—The local size in each dimension, which corresponds to the tile size in this dimension, or the local index in each dimension, which corresponds to the tile index in this dimension
Group information—The number of groups in each dimension, which corresponds to the number of groups in this dimension, or the group index in each dimension, which corresponds to the group index in this dimension
CUDA 中提供了类似的信息,但全局索引必须根据本地线程索引加上块(平铺)信息计算出来:
Similar information is available in CUDA, but the global index must be calculated from the local thread index plus the block (tile) information:
gid = blockIdx.x *blockDim.x + threadIdx.x;
gid = blockIdx.x *blockDim.x + threadIdx.x;
图 10.9 显示了 OpenCL 和 CUDA 的工作组(块或瓦片)的索引。OpenCL 的函数调用首先,然后是 CUDA 定义的变量。所有这些索引支持都是通过 GPU 的数据分解自动为您完成的,从而大大简化了从全局空间到图块的映射处理。
Figure 10.9 presents the indexing for the work group (block or tile) for OpenCL and CUDA. The function call for OpenCL is first, followed by the variable defined by CUDA. All of this indexing support is automatically done for you by the data decomposition for the GPU, greatly simplifying the handling of the mapping from the global space to the tile.
图 10.9 单个工作项的索引到全局索引空间的映射。首先给出 OpenCL 调用,然后是 CUDA 中定义的变量。
Figure 10.9 Mapping of the index of individual work item to global index space. The OpenCL call is given first, followed by the variable defined in CUDA.
每个工作组的索引大小应相同。这是通过将全局计算域填充到本地工作组大小的倍数来完成的。我们可以使用一些整数运算来做到这一点,以获得一个额外的工作组和一个填充的全局工作大小。下面的示例演示一种使用基本整数运算的方法,然后使用 C ceil 内部函数的第二种方法。
The size of the indices for each work group should be identical. This is done by padding the global computational domain out to a multiple of the local work group size. We can do this with some integer arithmetic to get one extra work group and a padded global work size. The following example shows an approach using basic integer operations and then a second with the C ceil intrinsic function.
global_work_sizex = ((global_sizex + local_work_sizex - 1)/
local_work_sizex) * local_work_sizexglobal_work_sizex = ((global_sizex + local_work_sizex - 1)/
local_work_sizex) * local_work_sizex
注意避免越界读取和写入在 GPU 内核中非常重要,因为它们会导致随机内核崩溃,而不会出现错误消息或输出。
Note Avoiding out-of-bound reads and writes is important in GPU kernels because they lead to random kernel crashes with no error message or output.
内存仍然是影响应用程序编程计划的最重要问题。幸运的是,当今的 GPU 上有很多内存。NVIDIA V100 和 AMD Radeon Instinct MI50 GPU 都支持 32 GB 的 RAM。与具有 128 GB 内存的配置良好的 HPC CPU 节点相比,具有 4-6 个 GPU 的 GPU 计算节点具有相同的内存。GPU 计算节点上的内存与 CPU 上的内存一样多。因此,我们可以使用与 CPU 相同的内存分配策略,而不必由于 GPU 内存有限而来回传输数据。
Memory is still the most important concern impacting your application programming plan. Fortunately, there is a lot of memory on today’s GPUs. Both the NVIDIA V100 and AMD Radeon Instinct MI50 GPUs support 32 GB of RAM. Compared to well-provisioned HPC CPU nodes with 128 GB of memory, a GPU compute node with 4-6 GPUs has the same memory. There is as much memory on GPU compute nodes as on CPUs. Therefore, we can use the same memory allocation strategy as we have for the CPU and not have to transfer data back and forth due to limited GPU memory.
GPU 的内存分配必须在 CPU 上完成。通常,同时为 CPU 和 GPU 分配内存,然后在它们之间传输数据。但是,如果可能,您应该只为 GPU 分配内存。这避免了从 CPU 来回传输昂贵的内存,并释放了 CPU 上的内存。使用动态内存分配的算法会给 GPU 带来问题,需要转换为静态内存算法,并提前知道内存大小。最新的 GPU 在可能的情况下将不规则或随机的内存访问合并为单个、连贯的缓存行加载方面做得很好。
Memory allocation for the GPU has to be done on the CPU. Often, memory is allocated for both the CPU and the GPU at the same time and then data is transferred between them. But if possible, you should allocate memory only for the GPU. This avoids expensive memory transfers back and forth from the CPU and frees up memory on the CPU. Algorithms that use a dynamic memory allocation present a problem for the GPU and need to be converted to a static memory algorithm, with the memory size known ahead of time. The latest GPUs do a good job of coalescing irregular or shuffled memory accesses into single, coherent cache-line loads when possible.
定义合并内存负载是将来自线程组的单独内存负载组合成单个缓存行负载。
Definition Coalesced memory loads are the combination of separate memory loads from groups of threads into a single cache-line load.
在 GPU 上,内存合并是在内存控制器的硬件级别完成的。这些聚结载荷的性能提升是巨大的。但同样重要的是,早期 GPU 编程指南中的许多优化不再需要,从而显著减少了 GPU 编程工作量。
On the GPU, the memory coalescing is done at the hardware level in the memory controller. The performance gains from these coalesced loads are substantial. But also important is that a lot of the optimizations from earlier GPU programming guides are no longer necessary, significantly reducing the GPU programming effort.
您可以通过对多次使用的数据使用本地(共享)内存来获得一些额外的加速。这种优化曾经对性能很重要,但 GPU 上更好的缓存使加速变得不那么显着。关于如何使用本地内存,有几种策略,具体取决于您是否可以预测所需的本地内存的大小。图 10.10 左侧显示了常规网格方法,右侧显示了用于非结构化和自适应网格细化的不规则网格。常规网格有四个相邻的图块,具有重叠的光晕区域。自适应网格细化仅显示四个单元;典型的 GPU 应用程序将加载 128 或 256 个单元,然后在外围引入所需的相邻单元。
You can get some additional speedup from using local (shared) memory for data that is used more than once. This optimization used to be important for performance, but the better cache on GPUs is making the speedup less significant. There are a couple of strategies on how to use the local memory, depending on whether you can predict the size of the local memory required. Figure 10.10 shows the regular grid approach on the left and the irregular grid for unstructured and adaptive mesh refinement on the right. The regular grid has four abutting tiles with overlapping halo regions. The adaptive mesh refinement shows only four cells; a typical GPU application would load 128 or 256 cells and then bring in the required neighbor cells around the periphery.
图 10.10 对于常规网格上的模板,将所有数据加载到本地内存中,然后使用本地内存进行计算。内部实心矩形是计算瓦片。外部虚线矩形包含计算所需的相邻数据。您可以使用协作加载将外部矩形中的数据加载到每个工作组的本地内存中。由于不规则网格的大小不可预测,因此仅将计算区域加载到本地内存中,其余部分使用每个线程的 registers。
Figure 10.10 For stencils on regular grids, load all the data into local memory and then use local memory for the computation. The inner solid rectangle is the computational tile. The outer dashed rectangle encloses the neighboring data needed for the calculation. You can use cooperative loads to load the data in the outer rectangle into the local memory for each work group. Because irregular grids have an unpredictable size, load only the computed region into local memory and use registers for each thread for the rest.
The processes for the two cases are
线程需要与相邻线程相同的内存负载。一个很好的例子是我们在整本书中使用的模板操作。线程 i 需要 i-1 和 i+1 值,这意味着多个线程将需要相同的值。这种情况的最佳方法是进行协作内存加载。将内存值从全局内存复制到本地(共享)内存可以显著提高速度。
Threads need the same memory loads as adjacent threads. A good example of this is the stencil operation we use throughout the book. Thread i needs the i-1 and i+1 values, which means that multiple threads will need the same values. The best approach for this situation is to do cooperative memory loads. Copying the memory values from global memory to local (shared) memory results in a significant speedup.
不规则网格具有不可预测的邻居数量,因此难以加载到本地内存中。处理此问题的一种方法是将要计算的网格部分复制到本地内存中。然后将 neighbor 数据加载到每个线程的 registers 中。
An irregular mesh has an unpredictable number of neighbors, making it difficult to load into local memory. One way to handle this is to copy the part of the mesh to be computed into local memory. Then load the neighbor data into registers for each thread.
这些并不是利用 GPU 上的内存资源的唯一方法。请务必仔细考虑与有限资源相关的问题以及特定应用程序的潜在性能优势。
These are not the only ways to utilize the memory resources on the GPU. It is important to think through the issues with regard to the limited resources and the potential performance benefits for your particular application.
良好的 GPU 编程的关键是管理可用于执行内核的有限资源。让我们看看表 10.4 中一些更重要的资源限制。超出可用资源可能会导致性能显著降低。NVIDIA 计算能力 7.0 适用于 V100 芯片。较新的 Ampere A100 芯片使用 8.0 的计算能力,资源限制几乎相同。
The key to good GPU programming is to manage the limited resources available for executing kernels. Let’s look at a few of the more important resource limitations in table 10.4. Exceeding the available resources can lead to significant decreases in performance. The NVIDIA compute capability 7.0 is for the V100 chip. The newer Ampere A100 chip uses a compute capability of 8.0 with nearly identical resource limits.
Table 10.4 Some resource limitations on current GPUs
GPU 程序员可用的最重要控件是工作组大小。起初,似乎需要为每个工作组使用最大线程数。但对于计算内核来说,计算内核与图形内核相比的复杂性意味着对计算资源的需求很大。这俗称内存压力或寄存器压力。减小工作组大小可为每个工作组提供更多资源。它还为上下文切换提供了更多工作组,我们在 10.1.1 节中讨论过。获得良好 GPU 性能的关键是在工作组规模和资源之间找到适当的平衡。
The most important control available to the GPU programmer is the work group size. At first, it would seem that using the maximum number of threads per work group would be desirable. But for computational kernels, the complexity of computational kernels in comparison to graphics kernels means that there are a lot of demands on compute resources. This is known colloquially as memory pressure or register pressure. Reducing the work group size gives each work group more resources to work with. It also gives more work groups for context switching, which we discussed in section 10.1.1. The key to getting good GPU performance is finding the right balance of work group size and resources.
定义 内存压力是计算内核资源需求对 GPU 内核性能的影响。寄存器压力是一个类似的术语,指的是对内核中寄存器的要求。
Definition Memory pressure is the effect of the computational kernel resource needs on the performance of GPU kernels. Register pressure is a similar term, referring to demands on registers in the kernel.
对特定内核的资源需求和 GPU 上可用资源的全面分析需要进行相关分析。我们将举例说明其中几种类型的深入研究。在接下来的两节中,我们将了解
A full analysis of the resource requirements of a particular kernel and the resources available on the GPU requires an involved analysis. We’ll give examples of a couple of these types of deep dives. In the next two sections, we look at
您可以通过在 nvcc compile 命令中添加 -Xptxas=“-v” 标志来了解代码使用了多少个寄存器。在 NVIDIA GPU 的 OpenCL 中,对 OpenCL 编译行使用 -cl-nv-verbose 标志以获得类似的输出。
You can find out how many registers your code uses by adding the -Xptxas="-v" flag to the nvcc compile command. In OpenCL for NVIDIA GPUs, use the -cl-nv-verbose flag for the OpenCL compile line to get a similar output.
我们已经讨论了延迟和上下文切换对 GPU 良好性能的重要性。“适当大小”工作组的好处是可以同时进行更多的工作组。对于 GPU 来说,这很重要,因为当工作组的进度因内存延迟而停滞时,它需要有其他可以执行的工作组来隐藏延迟。为了设置适当的工作组规模,我们需要某种度量。在 GPU 上,用于分析工作组的度量称为占用率。占用率是衡量计算单元在计算过程中的繁忙程度的指标。该措施很复杂,因为它取决于许多因素,例如所需的内存和使用的 registers。确切的定义是
We have discussed the importance of latency and context switching for good performance on the GPU. The benefit in “right-sized” work groups is that more work groups can be in flight at one time. For the GPU, this is important because when progress on a work group stalls due to memory latency, it needs to have other work groups that it can execute to hide the latency. To set the proper work group size, we need a measure of some sort. On GPUs, the measure used for analyzing work groups is called occupancy. Occupancy is a measure of how busy the compute units are during the calculation. The measure is complicated because it is dependent on a lot of factors, such as the memory required and the registers used. The precise definition is
Occupancy = Number of Active Threads/Maximum Number of Threads Per Compute Unit
由于每个子组的线程数是固定的,因此等效定义基于子组,也称为波前或 warps:
Because the number of threads per subgroup is fixed, an equivalent definition is based on subgroups, also known as wavefronts or warps:
Occupancy = Number of Active Subgroups/Maximum Number of Subgroups Per Compute Unit
活动子组或线程的数量由首先耗尽的工作组或线程资源决定。通常,这是一个工作组所需的 registers 或 local memory 的数量,以防止另一个工作组启动。我们需要一个工具,例如 CUDA 占用率计算器(在以下示例中介绍)来很好地做到这一点。NVIDIA 编程指南将大量注意力集中在最大限度地提高占用率上。虽然很重要,但只需要有足够的工作组进行切换,以隐藏延迟和停顿。
The number of active subgroups or threads is determined by the work group or thread resource that is exhausted first. Often this is the number of registers or local memory that is needed by a work group, preventing another work group from starting. We need a tool such as the CUDA Occupancy Calculator (presented in the following example) to do this well. NVIDIA programming guides focus a lot of attention on maximizing occupancy. While important, there just need to be enough work groups to switch between to hide latency and stalls.
到目前为止,我们所看到的单元、粒子、点和其他计算元素的计算循环可以通过图 10.8 中的方法处理,其中 for 循环从计算体中剥离出来,以创建 GPU 内核。进行这种转换既快速又简单,并且可以应用于科学应用程序中的绝大多数循环。但是在其他情况下,将代码转换为 GPU 非常困难。我们将研究需要更复杂方法的算法。例如,使用数组语法的单行 Fortran 代码:
Up to now, the computational loops we have looked at over cells, particles, points, and other computational elements could be handled by the approach in figure 10.8, where the for loops are stripped from the computational body to create a GPU kernel. Making this transformation is quick and easy and can be applied to the vast majority of loops in a scientific application. But there are other situations where the code conversion to the GPU is exceedingly difficult. We’ll look at algorithms that require a more sophisticated approach. Take for example, the single line of Fortran code using array syntax:
xmax = sum(x(:))
xmax = sum(x(:))
在 Fortran 中看起来很简单,但在 GPU 上要复杂得多。困难的根源在于我们无法进行跨工作组的合作工作或比较。实现此目的的唯一方法是退出内核。图 10.11 说明了处理这种情况的一般策略。
It looks so simple in Fortran, but it’s far more complicated on the GPU. The source of the difficulty is that we cannot do cooperative work or comparisons across work groups. The only way to accomplish this is to exit the kernel. Figure 10.11 illustrates the general strategy that deals with this situation.
图 10.11 GPU 上的缩减模式需要两个内核来同步多个工作组。我们退出由矩形表示的第一个内核,然后启动另一个单个工作组大小的内核,以允许线程协作进行最终传递。
Figure 10.11 The reduction pattern on the GPU requires two kernels to synchronize multiple work groups. We exit the first kernel, represented by the rectangle, and then start another one the size of a single work group to allow thread cooperation for the final pass.
为了便于说明,图 10.11 显示了一个 32 个元素长的数组。此方法的典型数组长度为数十万甚至数百万个元素,因此它比工作组的大小大得多。在第一步中,我们找到每个工作组的总和,并将其存储在一个临时数组中,即工作组或块的数量长度。第一次传递将数组的大小减小为工作组的大小,可以是 512 或 1,024。此时,我们无法在工作组之间进行通信,因此我们退出内核并启动一个只有一个工作组的新内核。剩余数据可能大于工作组大小 512 或 1,024,因此我们遍历暂存数组,将值汇总到每个工作项中。我们可以在工作组中的工作项之间进行通信,因此我们可以对单个全局值进行缩减,并在此过程中求和。
For ease of illustration, figure 10.11 shows an array 32 elements long. The typical array for this method would be hundreds of thousands or even millions of elements long so that it is much larger than the size of a work group. In the first step, we find the sum of each work group and store it in a scratch array the length of the number of work groups or blocks. The first pass reduces the size of the array by the size of our work group, which could be 512 or 1,024. At this point, we cannot communicate between work groups, so we exit the kernel and start a new kernel with just one work group. The remaining data might be greater than the work group size of 512 or 1,024, so we loop through the scratch array, summing up the values into each work item. We can communicate between the work items in the work group, so we can do a reduction to a single global value, summing along the way.
复杂!在 GPU 上执行此操作的代码需要数十行代码和两个内核来执行我们可以在 CPU 一行中完成的相同操作。我们将在第 12 章中介绍 CUDA 和 OpenCL 编程时看到更多实际代码。在 GPU 上获得的性能比 CPU 快,但需要大量的编程工作。我们将开始看到 GPU 的特性之一是很难进行同步和比较。
Complicated! The code to perform this operation on the GPU takes dozens of lines of code and two kernels to do the same operation that we can do in one line for the CPU. We’ll see more of the actual code for a reduction in chapter 12 when we cover CUDA and OpenCL programming. The performance that is obtained on the GPU is faster than the CPU, but it takes a lot of programming work. And we’ll start to see that one of the characteristics of GPUs is that synchronization and comparisons are hard to do.
我们将了解如何通过重叠数据传输和计算来更充分地利用 GPU。两个数据传输可以与 GPU 上的计算同时发生。
We are going to see how we can more fully utilize a GPU by overlapping data transfer and computation. Two data transfers can occur at the same time as a computation on a GPU.
GPU 上工作的基本性质是异步的。工作在 GPU 上排队,通常仅在请求结果或同步时执行。图 10.12 显示了发送到 GPU 进行计算的典型命令集。
The basic nature of work on GPUs is asynchronous. Work is queued up on the GPU and, usually, only gets executed when a result or synchronization is requested. Figure 10.12 shows a typical set of commands sent to a GPU for a computation.
图 10.12 只有在请求等待完成时,在默认队列中的 GPU 上计划的工作才会完成。我们计划将图形图像(图片)的副本复制到 GPU。然后,我们对数据安排了一个数学运算来修改它。我们还安排了第三次手术将其带回。在我们要求等待完成之前,这些操作都不必启动。
Figure 10.12 Work scheduled on a GPU in the default queue only gets completed when the wait for completion is requested. We scheduled the copy of a graphical image (a picture) to be copied to the GPU. Then we scheduled a mathematical operation on the data to modify it. We also scheduled a third operation to bring it back. None of these operations has to start until we demand the wait for completion.
我们还可以将工作安排在多个独立和异步的队列中。如图 10.13 所示,使用多个队列暴露了数据传输和计算重叠的可能性。大多数 GPU 语言都支持某种形式的异步工作队列。在 OpenCL 中,命令被排队,而在 CUDA 中,操作被放置在流中。虽然创造了并行性的可能性,但它是否真的发生取决于硬件功能和编码细节。
We can also schedule work in multiple queues that are independent and asynchronous. The use of multiple queues as illustrated in figure 10.13 exposes the potential for overlapping data transfer and computation. Most of the GPU languages support some form of asynchronous work queues. In OpenCL the commands are queued, and in CUDA, the operations are placed in streams. While the potential for parallelism is created, whether it actually happens is dependent on the hardware capabilities and coding details.
Figure 10.13 Staging work for three images in parallel queues
If we have a GPU capable of simultaneously performing these operations,
那么,如图 10.13 所示,在三个单独的队列中设置的工作可以重叠计算和通信。
then the work that is set up in three separate queues in figure 10.13 can overlap computation and communication as figure 10.14 shows.
图 10.14 重叠计算和数据传输将三张图像的时间从 75 ms 缩短到 45 ms。这是可能的,因为 GPU 可以同时执行计算、从主机到设备的数据传输以及从设备到主机的另一次数据传输。
Figure 10.14 Overlapping computation and data transfers reduce the time for three images from 75 ms to 45 ms. This is possible because the GPU can do a computation, a data transfer from the host to the device, and another one from the device to the host simultaneously.
现在,我们将继续利用我们对 GPU 编程模型的理解来开发应用程序并行化策略。我们将使用几个应用程序示例来演示该过程。
Now we’ll move on to using our understanding of the GPU programming model to develop a strategy for parallelization of our application. We’ll use a couple of application examples to demonstrate the process.
您的应用程序是一个大气模拟,大小范围从 1024x1024x1024 到 8192x8192x8192,其中 x 为垂直尺寸,y 为水平尺寸,z 为深度。让我们看看您可能会考虑的选项:
Your application is an atmospheric simulation ranging from 1024x1024x1024 to 8192x8192x8192 in size with x as the vertical dimension, y as the horizontal, and z as the depth. Let’s look at the options you might consider:
对于 GPU,我们需要数以万计的工作组才能实现有效的并行性。根据 GPU 规范(表 9.3),我们有 60-80 个计算单元,每个计算单元有 32 个双精度算术单元,用于大约 2000 个同步算术路径。此外,我们需要更多的工作组,以便通过上下文切换来隐藏延迟。在 z 维度上分配数据会得到 1024 到 8192 个工作组,这对于 GPU 并行度来说是很低的。
For GPUs, we need tens of thousands of work groups for effective parallelism. From the GPU specification (table 9.3), we have 60-80 compute units of 32 double-precision arithmetic units for about 2,000 simultaneous arithmetic pathways. In addition, we need more work groups for latency hiding via context switching. Distributing data across the z-dimension gets us 1,024 to 8,192 work groups, which is low for a GPU parallelism.
让我们看看每个工作组所需的资源。最小尺寸为 1024x1024 平面,加上虚影单元中任何必需的相邻数据。我们假设两个方向上都有一个 ghost cell。因此,我们需要 1024 × 1024 × 3 × 8 字节或 24 MiB 的本地数据。从表 10.4 来看,GPU 有 64-96 KiB 的本地数据,因此我们无法将数据预加载到本地内存中以加快处理速度。
Let’s look at the resources needed for each work group. The minimum dimensions would be a 1024x1024 plane, plus any required neighbor data in ghost cells. We’ll assume one ghost cell in both directions. We would therefore need 1024 × 1024 × 3 × 8 bytes or 24 MiB of local data. Looking at table 10.4, GPUs have 64-96 KiB of local data, so we would not be able to preload data into local memory for faster processing.
分布在两个维度上将为我们提供超过 100 万个潜在工作组,因此我们将为 GPU 提供足够的独立工作组。对于每个工作组,我们将有 1,024 到 8,192 个单元格。我们有自己的 cell 加上 4 个邻居,用于 1024 × 5 × 8 = 40 KiB 所需的本地内存。对于较大的问题,并且每个单元格有多个变量,我们将没有足够的局部内存。
Distributing across two dimensions would give us over a million potential work groups, so we would have enough independent work groups for the GPU. For each work group, we would have 1,024 to 8,192 cells. We have our own cell plus 4 neighbors for 1024 × 5 × 8 = 40 KiB minimum of required local memory. For larger problems and with more than one variable per cell, we would not have enough local memory.
使用表 10.3 中的模板,对于每个工作组,让我们尝试使用 4x4x8 单元格切片。对于邻居,这是 6 × 6 × 10 × 8 字节,用于所需的最小 2.8 KiB 本地内存。我们可以每个单元格有更多的变量,并且可以尝试将图块大小调大一点。
Using the template from table 10.3, for each work group, let’s try using a 4x4x8 cell tile. With neighbors, this is 6 × 6 × 10 × 8 bytes for 2.8 KiB minimum of required local memory. We could have more variables per cell and can experiment with making the tile size a little larger.
8 字节的 1024x1024x1024 单元切片×总内存要求为 8 GiB。这是一个很大的问题。GPU 具有多达 32 GiB 的 RAM,因此问题可能适合一个 GPU。更大的问题可能需要多达 512 个 GPU。因此,我们也应该使用 MPI 规划分布式内存并行性。
Total memory requirements for the 1024x1024x1024 cell tile × 8 bytes is 8 GiB. This is a large problem. GPUs have as much as 32 GiB of RAM, so the problem would possibly fit on one GPU. Larger size problems would require potentially up to 512 GPUs. So we should plan for distributed memory parallelism using MPI as well.
让我们将其与 CPU 进行比较,在这些 CPU 中,这些设计决策会产生不同的结果。我们可能需要将工作分散到 44 个进程中,每个进程的资源限制较少。虽然 3D 方法可以奏效,但 1D 和 2D 也是可行的。现在,让我们将其与非结构化网格进行对比,其中数据全部包含在 1D 数组中。
Let’s compare this to the CPU where these design decisions would have different outcomes. We might have work to spread across 44 processes, each with fewer resource restrictions. While the 3D approach could work, the 1D and 2D will also be feasible. Now let’s contrast that to an unstructured mesh where the data is all contained in 1D arrays.
在这种情况下,您的应用程序是一个 3D 非结构化网格,使用四面体或多边形单元,范围从 1 到 1000 万个单元。但数据是多边形的 1D 列表,其中包含包含空间位置的数据,例如 x、y 和 z。在这种情况下,只有一个选项:1D 数据分布。
In this case, your application is a 3D unstructured mesh using tetrahedral or polygonal cells that range from 1 to 10 million cells. But the data is a 1D list of polygons with data such as x, y, and z that contains the spatial location. In this case, there’s only one option: 1D data distribution.
由于数据是非结构化的,并且包含在 1D 数组中,因此选择更简单。我们在 1D 中分发数据,图块大小为 128。这为我们提供了 8,000 到 80,000 个工作组,为 GPU 提供了大量工作来切换和隐藏延迟。内存要求为 128 × 8 字节双精度值 = 1 KB,允许每个单元格为多个数据值留出空间。
Because the data is unstructured and contained in 1D arrays, the choices are simpler. We distribute the data in 1D with a tile size of 128. This gives us from 8,000 to 80,000 work groups, providing plenty of work for the GPU to switch between and hide latency. The memory requirements are 128 × 8 byte double-precision value = 1 KB, allowing space for multiple data values per cell.
我们还需要为一些整数映射和相邻数组提供空间,以提供单元之间的连接。邻居数据被加载到每个线程的 registers 中,因此我们不必担心对本地内存的影响,也不必担心可能会超过内存限制。最大大小的网格(1000 万个单元)需要 80 MB,外加用于面、邻域和映射数组的空间。这些连接数组可能会显著增加内存使用量,但单个 GPU 上应该有足够的内存来运行计算,即使是最大尺寸的网格也是如此。
We will also need space for some integer mapping and neighbor arrays to provide the connectivity between the cells. Neighbor data is loaded into registers for each thread so that we don’t have to worry about the impact on local memory and possibly blowing past the memory limit. The largest size mesh at 10 million cells requires 80 MB, plus space for face, neighbor, and mapping arrays. These connectivity arrays can increase the memory usage significantly, but there should be plenty of memory on a single GPU to run computations on even the largest size meshes.
为了获得最佳结果,我们需要通过使用数据分区库或使用空间填充曲线为非结构化数据提供一些局部性,该曲线使数组中的单元格彼此靠近,而单元格在空间上彼此靠近。
For best results, we will need to provide some locality for the unstructured data by using a data-partitioning library or by using a space-filling curve that keeps cells close to each other in the array that are close to each other spatially.
虽然 GPU 编程模型的基本轮廓已经稳定下来,但仍发生了很多变化。特别是,随着目标用途从 2D 扩展到 3D 图形和物理模拟,内核的可用资源逐渐增加,以获得更逼真的游戏。科学计算和机器学习等市场也变得越来越重要。针对这两个市场,已经开发了定制的 GPU 硬件:用于科学计算的双精度和用于机器学习的 Tensor Core。
While the basic contours of the GPU programming model have stabilized, there are still a lot of changes occurring. In particular, the resources available for the kernels have slowly increased as the target uses broaden from 2D to 3D graphics and physics simulations for more realistic games. Markets such as scientific computing and machine learning are also becoming more important. For both these markets, custom GPU hardware has been developed: double precision for scientific computing and tensor cores for machine learning.
在我们的演示中,我们主要讨论了独立 GPU。但也有集成的 GPU,如 9.1.1 节中首次讨论的那样。加速处理单元 (APU) 是 AMD 提供的产品。AMD 的 APU 和 Intel 的集成 GPU 在降低内存传输成本方面都提供了一些优势,因为它们不再位于 PCI 总线上。这被 GPU 晶体管硅面积的减少和更低的功率包络所抵消。尽管如此,这种能力自出现以来一直被低估。主要开发重点是高端 HPC 系统中的大型独立 GPU。但相同的 GPU 编程语言和工具同样适用于集成的 GPU。开发新的加速应用程序的关键限制是有关如何编程和利用这些设备的广泛知识。
In our presentation, we’ve mostly discussed discrete GPUs. But there are also integrated GPUs as first discussed in section 9.1.1. The Accelerated Processing Unit (APU) is an AMD product offering. Both AMD’s APU and Intel’s integrated GPUs offer some advantages in reducing the memory transfer costs because these are no longer on the PCI bus. This is offset by the reduction in the silicon area for GPU transistors and a lower power envelope. Still, this capability has been underappreciated since it appeared. The primary development focus has been on the big discrete GPUs that are in the top-end HPC systems. But the same GPU programming languages and tools work equally as well with integrated GPUs. The critical limitation on developing new, accelerated applications is the widespread knowledge on how to program and exploit these devices.
其他大众市场设备(如 Android 平板电脑和手机)具有采用 OpenCL 语言的可编程 GPU。这些资源包括
Other mass-market devices such as Android tablets and cell phones have programmable GPUs with the OpenCL language. Some resources for these include
从 Google Play 下载 OpenCL-Z 和 OpenCL-X 基准测试应用程序,以查看您的设备是否支持 OpenCL。驱动程序也可能从硬件供应商处获得。
Download OpenCL-Z and OpenCL-X benchmark applications from Google Play to see if your device supports OpenCL. Drivers may also be available from hardware vendors.
Compubench (https://compubench.com) 提供了某些使用 OpenCL 或 CUDA 的移动设备的性能结果。
Compubench (https://compubench.com) has performance results for some mobile devices that use OpenCL or CUDA.
英特尔在 https://soft ware.intel.com/en-us/android/articles/opencl-basic-sample-for-android-os 上有一个关于使用 OpenCL for Android 进行编程的不错网站。
Intel has a nice site on programming with OpenCL for Android at https://soft ware.intel.com/en-us/android/articles/opencl-basic-sample-for-android-os.
近年来,GPU 硬件和软件增加了对其他类型的编程模型的支持,例如基于任务的方法(参见图 1.25)和图形算法。这些替代编程模型长期以来一直是并行编程的兴趣所在,但在效率和规模方面一直存在困难。有一些关键应用,例如稀疏矩阵求解器,如果不在这些领域取得进一步发展,就无法轻松实现。但根本问题是是否可以公开足够的并行性(向硬件揭示)以利用 GPU 的大规模并行架构。只有时间会证明一切。
In recent years, GPU hardware and software have added support for other types of programming models, such as task-based approaches (see figure 1.25) and graph algorithms. These alternative programming models have long been an interest in parallel programming, but have struggled with efficiency and scale. There are critical applications, such as sparse matrix solvers, that cannot easily be implemented without further advances in these areas. But the fundamental question is whether enough parallelism can be exposed (revealed to the hardware) to utilize the massive parallel architecture of the GPUs. Only time will tell.
NVIDIA 长期以来一直支持 GPU 编程的研究。CUDA C 编程和最佳实践指南(可在 https://docs.nvidia.com/cuda 上获得)值得一读。其他资源包括
NVIDIA has long supported research into GPU programming. The CUDA C programming and best practices guides (available at https://docs.nvidia.com/cuda) are worth reading. Other resources include
GPU Gems 系列 (https://developer.nvidia.com/gpugems) 是一组较旧的论文,其中仍包含许多相关材料。
The GPU Gems series (https://developer.nvidia.com/gpugems) is an older set of papers that still contains a lot of relevant materials.
AMD also has a lot of GPU programming materials at their GPUOpen site at
https://gpuopen.com/compute-product/rocm/
https://gpuopen.com/compute-product/rocm/
https://rocm.github.io/documentation.html
https://rocm.github.io/documentation.html
AMD 提供了比较 https://rocm.github.io/languages.html 上可用的不同 GPU 编程语言术语的更好的表格之一。
AMD provides one of the better tables comparing terminology of different GPU programming languages available at https://rocm.github.io/languages.html.
尽管拥有大约 65% 的 GPU 市场(主要是集成 GPU),但英特尔®才刚刚开始成为 GPU 计算领域的重要参与者。他们宣布推出一款新的独立显卡,并将成为阿贡国家实验室 Aurora 系统的 GPU 供应商(将于 2022 年交付)。Aurora 系统是有史以来第一个百万兆次级系统,其性能是当前世界顶级系统的 6 倍。GPU 基于 Intel® Iris® Xe 架构,代号为“Ponte Vecchio”。Intel 大张旗鼓地发布了其 oneAPI 编程计划。oneAPI 工具包附带 Intel GPU 驱动程序、编译器和工具。转到 https:// software.intel.com/oneapi 了解更多信息和下载。
Despite having about 65% of the GPU market (mostly integrated GPUs), Intel® is just beginning to be a serious player in GPU computation. They have announced a new discrete graphics board and will be the GPU vendor for the Aurora system at Argonne National Laboratory (to be delivered in 2022). The Aurora system is the first exascale system ever produced and has 6x the performance of the current top system in the world. The GPU is based on the Intel® Iris® Xe architecture, code named “Ponte Vecchio.” With much fanfare, Intel has released its oneAPI programming initiative. The oneAPI toolkit comes with the Intel GPU driver, compilers, and tools. Go to https:// software.intel.com/oneapi for more information and downloads.
您有一个图像分类应用程序,将每个文件传输到 GPU 需要 5 毫秒,处理需要 5 毫秒,恢复需要 5 毫秒。在 CPU 上,每张图像的处理时间为 100 毫秒。有 100 万张图像需要处理。CPU 上有 16 个处理内核。GPU 系统会更快地完成工作吗?
You have an image classification application that will take 5 ms to transfer each file to the GPU, 5 ms to process, and 5 ms to bring back. On the CPU, the processing takes 100 ms per image. There are one million images to process. You have 16 processing cores on the CPU. Would a GPU system do the work faster?
问题 1 中 GPU 的传输时间基于第三代 PCI 总线。如果你能得到一个 Gen4 PCI 总线,那会如何改变设计呢?Gen5 PCI 总线?对于图像分类,您不需要返回修改后的图像。这如何改变计算方式?
The transfer time for the GPU in problem 1 is based on a third generation PCI bus. If you can get a Gen4 PCI bus, how does that change the design? A Gen5 PCI bus? For image classification, you shouldn’t need to bring back a modified image. How does that change the calculation?
对于您的独立 GPU(或 NVIDIA GeForce GTX 1060,如果没有),您可以运行多大大小的 3D 应用程序?假设每个单元格有 4 个双精度变量,使用限制为 GPU 内存的一半,以便您有空间用于临时数组。如果使用单精度,情况会如何变化?
For your discrete GPU (or NVIDIA GeForce GTX 1060, if none), what size 3D application could you run? Assume 4 double-precision variables per cell and a usage limit of half the GPU memory so you have room for temporary arrays. How does this change if you use single precision?
GPU 上的并行性需要位于数千个独立的工作项中,因为有数千个独立的算术单元。CPU 只需要数十个独立工作项的并行性,即可在处理内核之间分配工作。因此,对于 GPU 来说,在我们的应用程序中公开更多的并行性以保持处理单元繁忙是很重要的。
Parallelism on the GPU needs to be in the thousands of independent work items because there are thousands of independent arithmetic units. The CPU only needs parallelism in the tens of independent work items to distribute work across the processing cores. Thus, for the GPU, it is important to expose more parallelism in our applications to keep the processing units busy.
不同的 GPU 供应商具有类似的编程模型,这些模型是由高帧速率图形的需求驱动的。因此,可以开发一种适用于许多不同 GPU 的通用方法。
Different GPU vendors have similar programming models driven by the needs of high-frame-rate graphics. Because of this, a general approach can be developed that is applicable across many different GPUs.
GPU 编程模型特别适合于具有大量计算数据的数据并行性,但对于某些具有大量协调的任务(例如缩减)来说可能很困难。结果是许多高度并行的循环很容易移植,但有些循环需要付出很多努力。
The GPU programming model is particularly well suited for data parallelism with large sets of computational data but can be difficult for some tasks with a lot of coordination, such as reductions. The result is that many highly parallel loops port easily, but there are some that take a lot of effort.
将计算循环分离为循环体和循环控制或索引集是 GPU 编程的一个强大概念。循环体成为 GPU 内核,CPU 进行内存分配并调用内核。
The separation of a computational loop into a loop body and the loop control, or index set, is a powerful concept for GPU programming. The loop body becomes the GPU kernel, and the CPU does the memory allocation, and invokes the kernel.
Asynchronous work queues can overlap communication and computation. This can help to improve the utilization rate of the GPU.
人们一直在争先恐后地为 GPU 编程建立基于指令的语言标准。1997 年发布的基于指令的卓越语言 OpenMP 是 GPU 编程的简便方法。当时,OpenMP 正在迎头赶上,主要专注于新的 CPU 功能。为了解决 GPU 可访问性问题,2011 年,一小群编译器供应商(Cray、PGI 和 CAPS)以及作为 GPU 供应商的 NVIDIA 联合发布了 OpenACC 标准,为 GPU 编程提供了更简单的途径。与您在第 7 章中看到的 OpenMP 类似,OpenACC 也使用编译指示。在这种情况下,OpenACC 编译指示编译器生成 GPU 代码。几年后,OpenMP 架构审查委员会 (ARB) 在 OpenMP 标准中添加了他们自己的 GPU 编译指示支持。
There has been a scramble to establish standards for directive-based languages for programming for GPUs. The pre-eminent directive-based language, OpenMP, released in 1997, was the natural candidate to look to as an easier way to program GPUs. At that time, OpenMP was playing catchup and mainly focused on new CPU capabilities. To address GPU accessibility, in 2011, a small group of compiler vendors, (Cray, PGI and CAPS) along with NVIDIA as the GPU vendor, joined to release the OpenACC standard, providing a simpler pathway to GPU programming. Similar to what you saw in chapter 7 for OpenMP, OpenACC also uses pragmas. In this case, OpenACC pragmas direct the compiler to generate GPU code. A couple of years later, the OpenMP Architecture Review Board (ARB) added their own pragma support for GPUs to the OpenMP standard.
我们将通过 OpenACC 和 OpenMP 中的一些基本示例来了解它们的工作原理。我们建议您在目标系统上尝试这些示例,以查看可用的编译器及其当前状态。
We’ll work through some basic examples in OpenACC and OpenMP to give you an idea of how they work. We suggest that you try out the examples on your target system to see what compilers are available and their current status.
注意与往常一样,我们鼓励您按照 https://github.com/EssentialsofParallelComputing/Chapter11 上本章的示例进行操作。
Note As always, we encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/Chapter11.
许多程序员发现自己对于应该使用哪种基于指令的语言(OpenACC 或 OpenMP)“犹豫不决”。通常,一旦您发现所选系统上可用的内容,选择就很清楚了。请记住,需要克服的最大障碍就是开始。如果您稍后决定切换 GPU 语言,初步工作仍然很有价值,因为核心概念超越了语言。我们希望,通过了解使用 pragma 和指令生成 GPU 代码所需的工作量有多小,我们鼓励您在一些代码上尝试它。您甚至可以通过一点点努力体验到适度的加速。
Many programmers find themselves “on the fence” in regard to which directive-based language—OpenACC or OpenMP—they should use. Often, the choice is clear once you find out what is available on your system of choice. Keep in mind that the biggest hurdle to overcome is simply to start. If you later decide to switch GPU languages, the preliminary work will still prove valuable as the core concepts transcend the language. We hope that by seeing how little effort is required to generate GPU code using pragmas and directives, you will be encouraged to try it on some of your code. You may even experience a modest speedup with just a little effort.
对 C、C++ 或 Fortran 应用程序的指令或基于 pragma 的注释提供了访问 GPU 计算能力的更具吸引力的途径之一。与第 7 章中介绍的 OpenMP 线程模型非常相似,您只需向应用程序添加几行代码,编译器就会生成可在 GPU 或 CPU 上运行的代码。正如第 6 章和第 7 章中首次介绍的那样,pragma 是 C 和 C++ 中的预处理器语句,用于为编译器提供特殊指令。这些采用
Directive or pragma-based annotations to C, C++, or Fortran applications provide one of the more attractive pathways to access the compute power of GPUs. Much like the OpenMP threading model covered in chapter 7, you can add just a few lines to your application and the compiler generates code that can run on the GPU or the CPU. As first covered in chapters 6 and 7, pragmas are preprocessor statements in C and C++ that give the compiler special instructions. These take the form
#pragma acc <directive> [clause] #pragma omp <directive> [clause]
#pragma acc <directive> [clause] #pragma omp <directive> [clause]
特殊注释形式的指令为 Fortran 代码提供了相应的功能。这些指令以 comment 字符开头,后跟 acc 或 omp 关键字,以分别将它们标识为 OpenACC 和 OpenMP 的指令。
Directives in the form of special comments provide the corresponding capability for Fortran code. The directives start with the comment character, followed by either the acc or omp keyword to identify these as directives for OpenACC and OpenMP, respectively.
!$acc <directive> [clause] !$omp <directive> [clause]
!$acc <directive> [clause] !$omp <directive> [clause]
在应用程序中实施 OpenACC 和 OpenMP 的一般步骤相同。图 11.1 显示了这些步骤,我们将在以下部分中详细介绍它们。
The same general steps are used for implementing OpenACC and OpenMP in applications. Figure 11.1 shows these steps and we’ll detail them in the following sections.
图 11.1 使用基于 pragma 的语言实现 GPU 端口的步骤。将工作卸载到 GPU 会导致数据传输速度变慢,直到数据移动减少。
Figure 11.1 Steps to implement a GPU port with the pragma-based languages. Offloading the work to a GPU causes data transfers that slow down the application until the data movement is reduced.
我们总结了将代码转换为使用 OpenACC 或 OpenMP 在 GPU 上运行的三个步骤,如下所示:
We summarize the three steps that we will use to convert a code to run on the GPU with either OpenACC or OpenMP as follows:
Move the computationally intensive work to the GPU. This forces data transfers between the CPU and GPU that will slow down the code, but the work has to be moved first.
Reduce the data movement between the CPU and GPU. Move allocations to the GPU if the data is only used there.
Tune the size of the workgroup, number of workgroups, and other kernel parameters to improve kernel performance.
此时,应用程序在 GPU 上的运行速度会快得多。进一步优化可以提高性能,尽管这些优化往往针对每个应用程序更具体。
At this point, you will have an application running much faster on the GPU. Further optimizations are possible to improve performance, although these tend to be more specific for each application.
我们将从使用 OpenACC 运行一个简单的应用程序开始。我们这样做是为了展示让事情正常工作的基本细节。然后,我们将研究如何在应用程序运行后对其进行优化。正如基于 pragma 的方法所预期的那样,小努力会带来巨大的回报。但首先,您必须解决代码的初始减慢问题。不要绝望!在 GPU 上实现更快计算的过程中,遇到最初的减速是正常的。
We’ll start with getting a simple application running with OpenACC. We do this to show the basic details of getting things working. Then we’ll work on how to optimize the application once it is running. As might be expected with a pragma-based approach, there is a large payoff for a small effort. But first, you have to work through the initial slowdown of the code. Don’t despair! It is normal to encounter an initial slowdown on your journey to faster computations on a GPU.
通常最困难的步骤是获得一个有效的 OpenACC 编译器工具链。有几种可靠的 OpenACC 编译器可用。最值得注意的可用编译器如下:1
Often the most difficult step is getting a working OpenACC compiler toolchain. Several solid OpenACC compilers are available. The most notable of the available compilers are listed as follows:1
PGI—This is a commercial compiler, but note that PGI has a community edition for a free download.
GCC — 版本 7 和 8 实现了大多数 OpenACC 2.0a 规范。版本 9 实现了 OpenACC 2.5 规范的大部分内容。GCC 中的 OpenACC 开发分支正在开发 OpenACC 2.6,具有进一步的改进和优化功能。
GCC—Versions 7 and 8 implement most of the OpenACC 2.0a specification. Version 9 implements most of the OpenACC 2.5 specification. The OpenACC development branch in GCC is working on OpenACC 2.6, featuring further improvements and optimizations.
Cray — 另一个商业编译器;它仅在 Cray 系统上可用。Cray 宣布,从 9.0 版开始,他们将不再在基于 LLVM 的新 C/C++ 编译器中支持 OpenACC。支持 OpenACC 的编译器的 “classic” 版本将继续可用。
Cray—Another commercial compiler; it is only available on Cray systems. Cray has announced that they will no longer support OpenACC in their new LLVM-based C/C++ compiler as of version 9.0. A “classic” version of the compiler that supports OpenACC continues to be available.
对于这些示例,我们将使用 PGI 编译器(版本 19.7)和 CUDA(版本 10.1)。PGI 编译器是较容易获得的编译器中最成熟的选项。GCC 编译器是另一个选项,但请务必使用可用的最新版本。如果您可以访问 Cray 编译器的系统,那么 Cray 编译器是一个不错的选择。
For these examples, we’ll use the PGI compiler (version 19.7) and CUDA (version 10.1). The PGI compiler is the most mature option among the more readily available compilers. The GCC compiler is another option but be sure to use the most recent version available. The Cray compiler is a great option if you have access to their system.
注意如果你没有合适的 GPU 怎么办?您仍然可以通过使用 OpenACC 生成的内核在 CPU 上运行代码来尝试这些示例。性能会有所不同,但基本代码应该相同。
Note What if you don’t have a suitable GPU? You can still try the examples by running the code on your CPU with the OpenACC generated kernels. Performance will be different, but the basic code should be the same.
使用 PGI 编译器,您可以首先使用 pgaccelinfo 命令获取有关系统的信息。它还可以让您知道您的系统和环境是否正常工作。运行该命令后,输出应如图 11.2 所示。
With the PGI compiler, you can first get information on your system with the pgaccelinfo command. It also lets you know if your system and environment are in working order. After running the command, the output should look something like what is shown in figure 11.2.
图 11.2 pgaccelinfo 命令的输出显示了 GPU 的类型及其特征。
Figure 11.2 Output from the pgaccelinfo command shows the type of GPU and its characteristics.
清单 11.1 显示了 OpenACC makefile 的一些摘录。CMake 提供了 FindOpenACC.cmake 模块,该模块在清单的第 18 行中调用。完整的 CMakeLists.txt 文件包含在 OpenACC/StreamTriad 目录(https://github.com/EssentialsofParallelComputing/Chapter11)中该章的补充源代码中。我们为编译器反馈设置了一些标志,并使编译器对潜在的别名不那么保守。子目录中提供了 CMake 文件和简单的 makefile。
Listing 11.1 shows some excerpts from OpenACC makefiles. CMake provides the FindOpenACC.cmake module called in line 18 in the listing. The full CMakeLists.txt file is included in the supplemental source code for the chapter in the OpenACC/StreamTriad directory at https://github.com/EssentialsofParallelComputing/Chapter11. We set some flags for compiler feedback and for the compiler to be less conservative about potential aliasing. Both a CMake file and a simple makefile are provided in the subdirectory.
Listing 11.1 Excerpts from OpenACC makefiles
OpenACC/StreamTriad/CMakeLists.txt
8 if (NOT CMAKE_OPENACC_VERBOSE)
9 set(CMAKE_OPENACC_VERBOSE true)
10 endif (NOT CMAKE_OPENACC_VERBOSE)
11
12 if (CMAKE_C_COMPILER_ID MATCHES "PGI")
13 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -alias=ansi")
14 elseif (CMAKE_C_COMPILER_ID MATCHES "GNU")
15 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fstrict-aliasing")
16 endif (CMAKE_C_COMPILER_ID MATCHES "PGI")
17
18 find_package(OpenACC) ❶
19
20 if (CMAKE_C_COMPILER_ID MATCHES "PGI")
21 set(OpenACC_C_VERBOSE "${OpenACC_C_VERBOSE} -Minfo=accel")
22 elseif (CMAKE_C_COMPILER_ID MATCHES "GNU")
23 set(OpenACC_C_VERBOSE
"${OpenACC_C_VERBOSE} -fopt-info-optimized-omp")
24 endif (CMAKE_C_COMPILER_ID MATCHES "PGI")
25
26 if (CMAKE_OPENACC_VERBOSE) ❷
27 set(OpenACC_C_FLAGS
"${OpenACC_C_FLAGS} ${OpenACC_C_VERBOSE}") ❷
28 endif (CMAKE_OPENACC_VERBOSE) ❷
29
< ... skipping first target ... >
33 # Adds build target of stream_triad with source code files
34 add_executable(StreamTriad_par1 StreamTriad_par1.c timer.c timer.h)
35 set_source_files_properties(StreamTriad_par1.c PROPERTIES COMPILE_FLAGS
"${OpenACC_C_FLAGS}") ❸
36 set_target_properties(StreamTriad_par1 PROPERTIES LINK_FLAGS
"${OpenACC_C_FLAGS}") ❸OpenACC/StreamTriad/CMakeLists.txt
8 if (NOT CMAKE_OPENACC_VERBOSE)
9 set(CMAKE_OPENACC_VERBOSE true)
10 endif (NOT CMAKE_OPENACC_VERBOSE)
11
12 if (CMAKE_C_COMPILER_ID MATCHES "PGI")
13 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -alias=ansi")
14 elseif (CMAKE_C_COMPILER_ID MATCHES "GNU")
15 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fstrict-aliasing")
16 endif (CMAKE_C_COMPILER_ID MATCHES "PGI")
17
18 find_package(OpenACC) ❶
19
20 if (CMAKE_C_COMPILER_ID MATCHES "PGI")
21 set(OpenACC_C_VERBOSE "${OpenACC_C_VERBOSE} -Minfo=accel")
22 elseif (CMAKE_C_COMPILER_ID MATCHES "GNU")
23 set(OpenACC_C_VERBOSE
"${OpenACC_C_VERBOSE} -fopt-info-optimized-omp")
24 endif (CMAKE_C_COMPILER_ID MATCHES "PGI")
25
26 if (CMAKE_OPENACC_VERBOSE) ❷
27 set(OpenACC_C_FLAGS
"${OpenACC_C_FLAGS} ${OpenACC_C_VERBOSE}") ❷
28 endif (CMAKE_OPENACC_VERBOSE) ❷
29
< ... skipping first target ... >
33 # Adds build target of stream_triad with source code files
34 add_executable(StreamTriad_par1 StreamTriad_par1.c timer.c timer.h)
35 set_source_files_properties(StreamTriad_par1.c PROPERTIES COMPILE_FLAGS
"${OpenACC_C_FLAGS}") ❸
36 set_target_properties(StreamTriad_par1 PROPERTIES LINK_FLAGS
"${OpenACC_C_FLAGS}") ❸
❶ CMake module sets compiler flags for OpenACC
❷ Adds compiler feedback for accelerator directives
❸ Adds OpenACC flags to compile and link stream triad source
简单的 makefile 还可用于构建示例代码,方法是使用以下任一命令将这些代码复制或链接到 Makefile:
The simple makefiles can also be used for building the example codes by copying or linking these over to a Makefile by using either of these commands:
ln -s Makefile.simple.pgi Makefile cp Makefile.simple.pgi Makefile
ln -s Makefile.simple.pgi Makefile cp Makefile.simple.pgi Makefile
在 PGI 和 GCC 编译器的 makefile 中,我们显示了 OpenACC 的建议标志:
From the makefiles for the PGI and GCC compilers, we show the suggested flags for OpenACC:
Makefile.simple.pgi
6 CFLAGS:= -g -O3 -c99 -alias=ansi -Mpreprocess -acc -Mcuda -Minfo=accel
7
8 %.o: %.c
9 ${CC} ${CFLAGS} -c $^
10
11 StreamTriad: StreamTriad.o timer.o
12 ${CC} ${CFLAGS} $^ -o StreamTriad
Makefile.simple.gcc
6 CFLAGS:= -g -O3 -std=gnu99 -fstrict-aliasing -fopenacc \
-fopt-info-optimized-omp
7
8 %.o: %.c
9 ${CC} ${CFLAGS} -c $^
10
11 StreamTriad: StreamTriad.o timer.o
12 ${CC} ${CFLAGS} $^ -o StreamTriadMakefile.simple.pgi
6 CFLAGS:= -g -O3 -c99 -alias=ansi -Mpreprocess -acc -Mcuda -Minfo=accel
7
8 %.o: %.c
9 ${CC} ${CFLAGS} -c $^
10
11 StreamTriad: StreamTriad.o timer.o
12 ${CC} ${CFLAGS} $^ -o StreamTriad
Makefile.simple.gcc
6 CFLAGS:= -g -O3 -std=gnu99 -fstrict-aliasing -fopenacc \
-fopt-info-optimized-omp
7
8 %.o: %.c
9 ${CC} ${CFLAGS} -c $^
10
11 StreamTriad: StreamTriad.o timer.o
12 ${CC} ${CFLAGS} $^ -o StreamTriad
对于 PGI,为 GCC 启用 OpenACC 编译的标志是 -acc -Mcuda。Minfo=accel 标志指示编译器提供有关 accelerator 指令的反馈。我们还添加了 -alias=ansi 标志,以告诉编译器不要担心指针别名,以便它可以更自由地生成并行内核。在源代码中包括 arguments 的 restrict 属性仍然是一个好主意,以告诉编译器变量不指向重叠的内存区域。我们还在两个 makefile 中包含一个标志来设置 C 1999 标准,以便我们可以在循环中定义循环索引变量以获得更清晰的范围。-fopenacc 标志打开 GCC 的 OpenACC 指令解析。-fopt-info-optimized-omp 标志指示编译器为加速器的代码生成提供反馈。
For PGI, the flags to enable OpenACC compilation for GCC are -acc -Mcuda. The Minfo=accel flag tells the compiler to provide feedback on accelerator directives. We also include the -alias=ansi flag to tell the compiler to be less concerned about pointer aliasing so that it can more freely generate parallel kernels. It is still a good idea to include the restrict attribute on arguments in your source code to tell the compiler that variables do not point to overlapping regions of memory. We also include a flag in both makefiles to set the C 1999 standard so that we can define loop index variables in a loop for clearer scoping. The -fopenacc flag turns on the parsing of the OpenACC directives for GCC. The -fopt-info-optimized-omp flag tells the compiler to provide feedback for code generation for the accelerator.
对于 Cray 编译器,OpenACC 默认处于打开状态。如果需要关闭编译器选项 -hnoacc,可以使用它。OpenACC 编译器必须定义 _OPENACC 宏。宏特别重要,因为 OpenACC 仍处于许多编译器实现的过程中。您可以使用它来判断编译器支持的 OpenACC 版本,并通过与编译器宏 _OPENACC == yyyymm(其中版本日期)进行比较来为新功能实现条件编译
For the Cray compiler, OpenACC is on by default. You can use the compiler option -hnoacc if you need to turn it off. And the OpenACC compilers must define the _OPENACC macro. The macro is particularly important because OpenACC is still in the process of being implemented by many compilers. You can use it to tell what version of OpenACC your compiler supports and to implement conditional compilations for newer features by comparing against the compiler macro _OPENACC == yyyymm, where the version dates are
有两种不同的选项可用于声明用于计算的加速代码块。第一个是 kernels 编译指示,它使编译器可以自由地自动并行化代码块。此代码块可以包含具有多个循环的较大代码段。第二个是 parallel loop pragma,它告诉编译器为 GPU 或其他加速器设备生成代码。我们将介绍每种方法的示例。
There are two different options for declaring an accelerated block of code for computations. The first is the kernels pragma that gives the compiler freedom to auto-parallelize the code block. This code block can include larger sections of code with several loops. The second is the parallel loop pragma that tells the compiler to generate code for the GPU or other accelerator device. We’ll go over examples of each approach.
Using the kernels pragma to get auto-parallelization from the compiler
kernels pragma 允许编译器自动并行化代码块。它通常首先用于从编译器获取对一段代码的反馈。我们将介绍 kernels 编译指示的正式语法,包括其可选子句。然后,我们将查看我们在所有编程章节中使用的 stream triad 示例,并应用 kernels 编译指示。首先,我们将列出 OpenACC 2.6 标准中 kernels 编译指示的规范:
The kernels pragma allows auto-parallelization of a code block by the compiler. It is often used first to get feedback from the compiler on a section of code. We’ll cover the formal syntax for the kernels pragma, including its optional clauses. Then we’ll look at the stream triad example we used in all of our programming chapters and apply the kernels pragma. First, we’ll list the specification for the kernels pragma from the OpenACC 2.6 standard:
#pragma acc kernels [ data clause | kernel optimization | async clause |
conditional ]#pragma acc kernels [ data clause | kernel optimization | async clause |
conditional ]
data clauses - [ copy | copyin | copyout | create | no_create |
present | deviceptr | attach | default(none|present) ]
kernel optimization - [ num_gangs | num_workers | vector_length |
device_type | self ]
async clauses - [ async | wait ]
conditional - [ if ] data clauses - [ copy | copyin | copyout | create | no_create |
present | deviceptr | attach | default(none|present) ]
kernel optimization - [ num_gangs | num_workers | vector_length |
device_type | self ]
async clauses - [ async | wait ]
conditional - [ if ]
我们将在 Section 11.2.3 中更详细地讨论 data 子句,尽管如果这些子句仅适用于单个循环,您也可以在 kernel pragma 中使用 data 子句。我们将在 11.2.4 节中介绍内核优化。我们将在 11.2.5 节中简要提及 async 和 conditional 子句。
We’ll discuss the data clauses in more detail in section 11.2.3, although you can also use the data clauses in the kernel pragma if these only apply to a single loop. We’ll cover the kernel optimizations in section 11.2.4. And we’ll briefly mention the async and conditional clauses in section 11.2.5.
我们首先通过在目标代码块周围添加 #pragma acc 内核来指定我们希望工作并行化的位置。kernels pragma 适用于指令后面的代码块,或者适用于下一个清单中的代码,即 for 循环。
We first start by specifying where we want the work to be parallelized by adding #pragma acc kernels around the targeted blocks of code. The kernels pragma applies to the code block following the directive or for the code in the next listing, the for loop.
Listing 11.2 Adding the kernels pragma
OpenACC/StreamTriad/StreamTriad_kern1.c
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include "timer.h"
4
5 int main(int argc, char *argv[]){
6
7 int nsize = 20000000, ntimes=16;
8 double* a = malloc(nsize * sizeof(double));
9 double* b = malloc(nsize * sizeof(double));
10 double* c = malloc(nsize * sizeof(double));
11
12 struct timespec tstart;
13 // initializing data and arrays
14 double scalar = 3.0, time_sum = 0.0;
15 #pragma acc kernels ❶
16 for (int i=0; i<nsize; i++) { ❷
17 a[i] = 1.0; ❷
18 b[i] = 2.0; ❷
19 } ❷
20
21 for (int k=0; k<ntimes; k++){
22 cpu_timer_start(&tstart);
23 // stream triad loop
24 #pragma acc kernels ❶
25 for (int i=0; i<nsize; i++){ ❷
26 c[i] = a[i] + scalar*b[i]; ❷
27 } ❷
28 time_sum += cpu_timer_stop(tstart);
29 }
30
31 printf("Average runtime for stream triad loop is %lf msecs\n",
time_sum/ntimes);
32
33 free(a);
34 free(b);
35 free(c);
36
37 return(0);
38 }OpenACC/StreamTriad/StreamTriad_kern1.c
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include "timer.h"
4
5 int main(int argc, char *argv[]){
6
7 int nsize = 20000000, ntimes=16;
8 double* a = malloc(nsize * sizeof(double));
9 double* b = malloc(nsize * sizeof(double));
10 double* c = malloc(nsize * sizeof(double));
11
12 struct timespec tstart;
13 // initializing data and arrays
14 double scalar = 3.0, time_sum = 0.0;
15 #pragma acc kernels ❶
16 for (int i=0; i<nsize; i++) { ❷
17 a[i] = 1.0; ❷
18 b[i] = 2.0; ❷
19 } ❷
20
21 for (int k=0; k<ntimes; k++){
22 cpu_timer_start(&tstart);
23 // stream triad loop
24 #pragma acc kernels ❶
25 for (int i=0; i<nsize; i++){ ❷
26 c[i] = a[i] + scalar*b[i]; ❷
27 } ❷
28 time_sum += cpu_timer_stop(tstart);
29 }
30
31 printf("Average runtime for stream triad loop is %lf msecs\n",
time_sum/ntimes);
32
33 free(a);
34 free(b);
35 free(c);
36
37 return(0);
38 }
❶ Inserts OpenACC kernels pragma
❷ Code block for kernels pragma
The following output shows the feedback from the PGI compiler:
main:
15, Generating implicit copyout(b[:20000000],a[:20000000])
[if not already present]
16, Loop is parallelizable
Generating Tesla code
16, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
16, Complex loop carried dependence of a-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
24, Generating implicit copyout(c[:20000000]) [if not already present]
Generating implicit copyin(b[:20000000],a[:20000000])
[if not already present]
25, Complex loop carried dependence of a->,b-> prevents
parallelization
Loop carried dependence of c-> prevents parallelization
Loop carried backward dependence of c-> prevents vectorization
Accelerator serial kernel generated
Generating Tesla code
25, #pragma acc loop seq
25, Complex loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of c-> prevents vectorizationmain:
15, Generating implicit copyout(b[:20000000],a[:20000000])
[if not already present]
16, Loop is parallelizable
Generating Tesla code
16, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
16, Complex loop carried dependence of a-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
24, Generating implicit copyout(c[:20000000]) [if not already present]
Generating implicit copyin(b[:20000000],a[:20000000])
[if not already present]
25, Complex loop carried dependence of a->,b-> prevents
parallelization
Loop carried dependence of c-> prevents parallelization
Loop carried backward dependence of c-> prevents vectorization
Accelerator serial kernel generated
Generating Tesla code
25, #pragma acc loop seq
25, Complex loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of c-> prevents vectorization
此列表中不清楚的是,OpenACC 将每个 for 循环视为前面有一个 #pragma acc loop auto。我们将决定权留给编译器来决定它是否可以并行化循环。粗体输出表示编译器认为它不能。编译器告诉我们它需要帮助。最简单的解决方法是在清单 11.2 的第 8-10 行添加一个 restrict 属性。
What isn’t clear in this listing is that OpenACC treats each for loop as if it has a #pragma acc loop auto in front of it. We have left the decision to the compiler to decide whether it could parallelize the loop. The output in bold indicates that the compiler doesn’t think it can. The compiler is telling us it needs help. The simplest fix is to add a restrict attribute to lines 8-10 in listing 11.2.
8 double* restrict a = malloc(nsize * sizeof(double)); 9 double* restrict b = malloc(nsize * sizeof(double)); 10 double* restrict c = malloc(nsize * sizeof(double));
8 double* restrict a = malloc(nsize * sizeof(double)); 9 double* restrict b = malloc(nsize * sizeof(double)); 10 double* restrict c = malloc(nsize * sizeof(double));
帮助编译器的修复方法的第二个选择是更改指令,告诉编译器可以生成并行 GPU 代码。问题是我们前面提到的默认循环指令 (loop auto)。以下是 OpenACC 2.6 标准的规范:
Our second choice for a fix to help the compiler is to change the directive to tell the compiler it is Ok to generate parallel GPU code. The problem is the default loop directive (loop auto), which we mentioned earlier. Here is the specification from the OpenACC 2.6 standard:
#pragma acc loop [ auto | independent | seq | collapse | gang | worker |
vector | tile | device_type | private | reduction ]#pragma acc loop [ auto | independent | seq | collapse | gang | worker |
vector | tile | device_type | private | reduction ]
我们将在后面的章节中介绍其中的许多条款。现在,我们将重点介绍前三个选项:auto、independent 和 seq。
We cover many of these clauses in later sections. For now, we’ll focus on the first three: auto, independent, and seq.
seq, short for sequential, says to generate a sequential version.
independent asserts that the loop can and should be parallelized.
将子句从 auto 更改为 independent 会告诉编译器并行化循环:
Changing the clause from auto to independent tells the compiler to parallelize the loop:
15 #pragma acc kernels loop independent <Skipping unchanged code> 24 #pragma acc kernels loop independent
15 #pragma acc kernels loop independent <Skipping unchanged code> 24 #pragma acc kernels loop independent
请注意,我们在这些指令中结合了这两个构造。如果需要,可以将有效的单个子句组合到单个指令中。现在,输出显示循环已并行化:
Note that we have combined the two constructs in these directives. You can combine valid individual clauses into a single directive, if you like. Now the output shows that the loop is parallelized:
main:
15, Generating implicit copyout(a[:20000000],b[:20000000])
[if not already present]
16, Loop is parallelizable
Generating Tesla code
16, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
24, Generating implicit copyout(c[:20000000]) [if not already present]
Generating implicit copyin(b[:20000000],a[:20000000])
[if not already present]
25, Loop is parallelizable
Generating Tesla code
25, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */main:
15, Generating implicit copyout(a[:20000000],b[:20000000])
[if not already present]
16, Loop is parallelizable
Generating Tesla code
16, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
24, Generating implicit copyout(c[:20000000]) [if not already present]
Generating implicit copyin(b[:20000000],a[:20000000])
[if not already present]
25, Loop is parallelizable
Generating Tesla code
25, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
此输出中需要注意的重要一点是有关数据传输的反馈(以粗体显示)。我们将在 11.2.3 节中讨论如何解决这些反馈。
The important thing to note in this output is the feedback about data transfers (in bold). We’ll discuss how to address this feedback in section 11.2.3.
尝试使用 parallel loop pragma 以更好地控制并行化
Try the parallel loop pragma for more control over parallelization
接下来,我们将介绍如何使用 parallel loop 编译指示。这是我们建议您在应用程序中使用的技术。它与其他并行语言(如 OpenMP)中使用的形式更一致。它还可以在编译器之间生成更一致和可移植的性能。并非所有编译器都可以指望执行 kernels 指令所需的适当分析工作。
Next we’ll cover how to use the parallel loop pragma. This is the technique we recommend that you use in your application. It is more consistent with the form used in other parallel languages such as OpenMP. It also generates more consistent and portable performance across compilers. Not all compilers can be counted on to perform an adequate job of analysis required by the kernels directive.
parallel loop pragma 实际上是两个单独的指令。第一个是打开并行区域的 parallel 指令。第二个是循环编译指示,它将工作分配到并行工作元素之间。我们先看看 parallel 编译指示。parallel pragma 采用与 kernel 指令相同的子句。在以下示例中,我们粗体了 kernel 指令的附加子句:
The parallel loop pragma is actually two separate directives. The first is the parallel directive that opens a parallel region. The second is the loop pragma that distributes the work across the parallel work elements. We’ll look at the parallel pragma first. The parallel pragma takes the same clauses as the kernel directive. In the following example, we bolded the additional clauses for the kernel directive:
#pragma acc parallel [ clause ] data clauses - [ reduction | private | firstprivate | copy | copyin | copyout | create | no_create | present | deviceptr | attach | default(none|present) ] kernel optimization - [ num_gangs | num_workers | vector_length | device_type | self ] async clauses - [ async | wait ] conditional - [ if ]
#pragma acc parallel [ clause ] data clauses - [ reduction | private | firstprivate | copy | copyin | copyout | create | no_create | present | deviceptr | attach | default(none|present) ] kernel optimization - [ num_gangs | num_workers | vector_length | device_type | self ] async clauses - [ async | wait ] conditional - [ if ]
前面的 kernels 部分提到了 loop 结构的子句。需要注意的重要一点是,parallel 区域中 loop 结构的默认值是 independent 而不是 auto。同样,与 kernels 指令一样,组合的并行循环结构可以采用各个指令可以采用的任何子句。通过对并行循环结构的解释,我们继续讨论如何将它添加到流三元组示例中,如下面的清单所示。
The clauses for the loop construct were mentioned earlier in the kernels section. The important thing to note is that the default for the loop construct in a parallel region is independent rather than auto. Again, as in the kernels directive, the combined parallel loop construct can take any clause that the individual directives can. With this explanation of the parallel loop construct, we move on to how it is added to the stream triad example as shown in the following listing.
Listing 11.3 Adding a parallel loop pragma
OpenACC/StreamTriad/StreamTriad_par1.c 12 struct timespec tstart; 13 // initializing data and arrays 14 double scalar = 3.0, time_sum = 0.0; 15 #pragma acc parallel loop ❶ 16 for (int i=0; i<nsize; i++) { 17 a[i] = 1.0; 18 b[i] = 2.0; 19 } 20 21 for (int k=0; k<ntimes; k++){ 22 cpu_timer_start(&tstart); 23 // stream triad loop 24 #pragma acc parallel loop ❶ 25 for (int i=0; i<nsize; i++){ 26 c[i] = a[i] + scalar*b[i]; 27 } 28 time_sum += cpu_timer_stop(tstart); 29 }
OpenACC/StreamTriad/StreamTriad_par1.c 12 struct timespec tstart; 13 // initializing data and arrays 14 double scalar = 3.0, time_sum = 0.0; 15 #pragma acc parallel loop ❶ 16 for (int i=0; i<nsize; i++) { 17 a[i] = 1.0; 18 b[i] = 2.0; 19 } 20 21 for (int k=0; k<ntimes; k++){ 22 cpu_timer_start(&tstart); 23 // stream triad loop 24 #pragma acc parallel loop ❶ 25 for (int i=0; i<nsize; i++){ 26 c[i] = a[i] + scalar*b[i]; 27 } 28 time_sum += cpu_timer_stop(tstart); 29 }
❶ Inserts the parallel loop combined construct
The output from the PGI compiler is
main:
15, Generating Tesla code
16, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
15, Generating implicit copyout(a[:20000000],b[:20000000])
[if not already present]
24, Generating Tesla code
25, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
24, Generating implicit copyout(c[:20000000]) [if not already present]
Generating implicit copyin(b[:20000000],a[:20000000])
[if not already present]main:
15, Generating Tesla code
16, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
15, Generating implicit copyout(a[:20000000],b[:20000000])
[if not already present]
24, Generating Tesla code
25, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
24, Generating implicit copyout(c[:20000000]) [if not already present]
Generating implicit copyin(b[:20000000],a[:20000000])
[if not already present]
即使没有 restrict 属性,循环也会并行化,因为 loop 指令的默认值是 independent 子句。这与我们之前看到的 kernels 指令的默认值不同。不过,我们建议您在代码中使用 restrict 属性来帮助编译器生成最佳代码。
Even without the restrict attribute, the loop is parallelized because the default for the loop directive is the independent clause. This is different than the default for the kernels directive that we saw previously. Still, we recommend that you use the restrict attribute in your code to help the compiler generate the best code.
输出类似于前面的 kernels 指令。此时,由于我们在此编译器输出中以粗体显示的数据移动,代码的性能可能会减慢。不用担心;我们将在下一步中加快速度。
The output is similar to that from the previous kernels directive. At this point, the performance of the code will likely have slowed down due to the data movement we have shown in bold in this compiler output. Not to worry; we will speed it back up in the next step.
在我们继续解决数据移动之前,我们将快速了解一下 reductions 和 serial 构造。清单 11.4 显示了 6.3.3 节中首次引入的质量和示例。质量和是一个简单的缩减操作。我们在循环前放置了一个带有 reduction 子句的 OpenACC 并行循环编译指示,而不是 OpenMP SIMD 向量化编译指示。缩减的语法很熟悉,因为它与线程 OpenMP 标准使用的语法相同。
Before we move on to addressing the data movement, we’ll take a quick look at reductions and the serial construct. Listing 11.4 shows the mass sum example first introduced in section 6.3.3. The mass sum is a simple reduction operation. Instead of the OpenMP SIMD vectorization pragma, we placed an OpenACC parallel loop pragma with the reduction clause before the loop. The syntax of the reduction is familiar because it is the same as was used by the threaded OpenMP standard.
Listing 11.4 Adding a reduction clause
OpenACC/mass_sum/mass_sum.c
1 #include "mass_sum.h"
2 #define REAL_CELL 1
3
4 double mass_sum(int ncells, int* restrict celltype,
5 double* restrict H, double* restrict dx,
double* restrict dy){
6 double summer = 0.0;
7 #pragma acc parallel loop reduction(+:summer) ❶
8 for (int ic=0; ic<ncells ; ic++) {
9 if (celltype[ic] == REAL_CELL) {
10 summer += H[ic]*dx[ic]*dy[ic];
11 }
12 }
13 return(summer);
14 }OpenACC/mass_sum/mass_sum.c
1 #include "mass_sum.h"
2 #define REAL_CELL 1
3
4 double mass_sum(int ncells, int* restrict celltype,
5 double* restrict H, double* restrict dx,
double* restrict dy){
6 double summer = 0.0;
7 #pragma acc parallel loop reduction(+:summer) ❶
8 for (int ic=0; ic<ncells ; ic++) {
9 if (celltype[ic] == REAL_CELL) {
10 summer += H[ic]*dx[ic]*dy[ic];
11 }
12 }
13 return(summer);
14 }
❶ Adds a reduction clause to a parallel loop construct
您可以在 reduction 子句中使用其他运算符。这些参数包括 *、max、min、&、|、&& 和 ||。对于 OpenACC 2.6 及之前的版本,用逗号分隔的变量或变量列表仅限于标量,而不是数组。但 OpenACC 版本 2.7 允许您在 reduction 子句中使用数组和复合变量。
There are other operators that you can use in a reduction clause. These include *, max, min, &, |, &&, and ||. For OpenACC versions up to 2.6, the variable or list of variables separated by commas are limited to scalars and not arrays. But OpenACC version 2.7 lets you use arrays and composite variables in the reduction clause.
我们将在本节中介绍的最后一个结构是 serial work 的结构。某些循环不能并行完成。我们不是退出 parallel 区域,而是留在其中,并告诉编译器只在 serial 中执行这一部分。这是通过 serial 指令完成的:
The last construct we’ll cover in this section is the one for serial work. Some loops cannot be done in parallel. Rather than exit the parallel region, we stay within it and tell the compiler to just do this one part in serial. This is done with the serial directive:
#pragma acc serial
#pragma acc serial
带有 serial 指令的此代码块由一个 worker 组成的一组执行,向量长度为 1。现在,让我们将注意力转向解决数据移动反馈。
Blocks of this code with the serial directive are executed by one gang of one worker with a vector length of one. Now, let’s turn our attention to addressing the data movement feedback.
本节回到我们在本书中看到的一个主题。数据移动比 flops 更重要。尽管我们通过将计算移动到 GPU 来加快计算速度,但由于数据移动的成本,总体运行时间已经减慢。解决过多的数据移动问题将开始产生整体加速。为此,我们将 data 结构添加到我们的代码中。在 OpenACC 标准 v2.6 中,数据结构的规范如下:
This section returns to a theme we have seen throughout this book. Data movement is more important than flops. Although we have sped up the computations by moving these to the GPU, the overall run time has slowed because of the cost of data movement. Addressing the excessive data movement will start yielding an overall speedup. To do this, we add the data construct to our code. In the OpenACC standard, v2.6, the specification for the data construct is as follows:
#pragma acc data [ copy | copyin | copyout | create | no_create | present |
deviceptr | attach | default(none|present) ]#pragma acc data [ copy | copyin | copyout | create | no_create | present |
deviceptr | attach | default(none|present) ]
您还将看到对 present_or_copy 或速记 pcopy 等子句的引用,这些子句在创建副本之前检查数据是否存在。这些不再是必需的,尽管它们被保留以用于向后兼容。从 OpenACC 标准的版本 2.5 开始,标准子句已合并此行为。
You will also see references to clauses like present_or_copy or the shorthand pcopy that check for the presence of the data before making the copy. These are no longer necessary, though they are retained for backward compatibility. The standard clauses have incorporated this behavior beginning with version 2.5 of the OpenACC standard.
许多 data 子句都采用一个参数,该参数列出了要复制或以其他方式操作的数据。数组的范围规范需要提供给编译器。这方面的一个例子是
Many of the data clauses take an argument that lists the data to be copied or otherwise manipulated. The range specification for the array needs to be given to the compiler. An example of this is
#pragma acc data copy(x[0:nsize])
#pragma acc data copy(x[0:nsize])
C/C++ 和 Fortran 的范围规范略有不同。在 C/C++ 中,规范中的第一个参数是起始索引,第二个参数是长度。在 Fortran 中,第一个参数是起始索引,第二个参数是结束索引。
The range specification is subtly different for C/C++ and Fortran. In C/C++, the first argument in the specification is the start index, and the second is the length. In Fortran, the first argument is the start index, and the second argument is the end index.
数据区域有两种类型。第一个是原始 OpenACC 版本 1.0 标准中的结构化数据区域。第二个是动态数据区域,是在 OpenACC 的 2.0 版中引入的。我们首先看一下结构化数据区域。
There are two varieties of data regions. The first is the structured data region from the original OpenACC version 1.0 standard. The second, a dynamic data region, was introduced in version 2.0 of OpenACC. We’ll look at the structured data region first.
Structured data region for simple blocks of code
结构化数据区域由代码块分隔。这可以是由循环或包含在一组大括号中的代码区域形成的自然代码块。在 Fortran 中,区域用 starting 指令标记,以 ending 指令结尾。清单 11.5 显示了一个结构化数据区域的示例,该区域以第 16 行的指令开头,由第 17 行的左大括号和第 37 行的结束大括号分隔。我们在代码中添加了对结束大括号的注释,以帮助识别大括号结束的代码块。
The structured data region is delimited by a code block. This can be a natural code block formed by a loop or a region of code contained within a set of curly braces. In Fortran, the region is marked with a starting directive and ends with an ending directive. Listing 11.5 shows an example of a structured data region that starts with the directive on line 16 and is delimited by the opening brace on line 17 and the ending brace on line 37. We have included a comment on the ending brace in the code to help identify the block of code that the brace ends.
Listing 11.5 Structured data block pragma
OpenACC/StreamTriad/StreamTriad_par2.c 16 #pragma acc data create(a[0:nsize],\ ❶ b[0:nsize],c[0:nsize]) ❶ 17 { ❷ 18 19 #pragma acc parallel loop present(a[0:nsize],\ ❸ b[0:nsize]) ❸ 20 for (int i=0; i<nsize; i++) { 21 a[i] = 1.0; 22 b[i] = 2.0; 23 } 24 25 for (int k=0; k<ntimes; k++){ 26 cpu_timer_start(&tstart); 27 // stream triad loop 28 #pragma acc parallel loop present(a[0:nsize],\ ❸ b[0:nsize],c[0:nsize]) ❸ 29 for (int i=0; i<nsize; i++){ 30 c[i] = a[i] + scalar*b[i]; 31 } 32 time_sum += cpu_timer_stop(tstart); 33 } 34 35 printf("Average runtime for stream triad loop is %lf msecs\n", time_sum/ntimes); 36 37 } //#pragma end acc data block(a[0:nsize],b[0:nsize],c[0:nsize]) ❹
OpenACC/StreamTriad/StreamTriad_par2.c 16 #pragma acc data create(a[0:nsize],\ ❶ b[0:nsize],c[0:nsize]) ❶ 17 { ❷ 18 19 #pragma acc parallel loop present(a[0:nsize],\ ❸ b[0:nsize]) ❸ 20 for (int i=0; i<nsize; i++) { 21 a[i] = 1.0; 22 b[i] = 2.0; 23 } 24 25 for (int k=0; k<ntimes; k++){ 26 cpu_timer_start(&tstart); 27 // stream triad loop 28 #pragma acc parallel loop present(a[0:nsize],\ ❸ b[0:nsize],c[0:nsize]) ❸ 29 for (int i=0; i<nsize; i++){ 30 c[i] = a[i] + scalar*b[i]; 31 } 32 time_sum += cpu_timer_stop(tstart); 33 } 34 35 printf("Average runtime for stream triad loop is %lf msecs\n", time_sum/ntimes); 36 37 } //#pragma end acc data block(a[0:nsize],b[0:nsize],c[0:nsize]) ❹
❶ The data directive defines the structured data region.
❸ The present directive tells the compiler that a copy is not needed.
❹ Closing brace marks the end of the data region
结构化数据区域指定在数据区域的开头创建三个数组。这些将在数据区域结束时销毁。两个并行循环使用 present 子句来避免计算区域的数据副本。
The structured data region specifies that the three arrays are to be created at the start of the data region. These will be destroyed at the end of the data region. The two parallel loops use the present clause to avoid data copies for the compute regions.
Dynamic data region for a more flexible data scoping
结构化数据区域最初由 OpenACC 使用,其中分配了内存,然后有一些循环,不适用于更复杂的程序。特别是,面向对象的代码中的内存分配发生在创建对象时。如何围绕具有这种程序结构的东西放置数据区域?
The structured data region, originally used by OpenACC, where memory is allocated and then there are some loops, does not work with more complicated programs. In particular, memory allocations in object-oriented code occur when an object is created. How do you put a data region around something with this kind of program structure?
为了解决这个问题,OpenACC v2.0 添加了动态(也称为非结构化)数据区域。此动态数据区域构造是专门为更复杂的数据管理方案创建的,例如 C++ 中的构造函数和析构函数。pragma 没有使用范围大括号来定义数据区域,而是有一个 enter 和一个 exit 子句:
To address this problem, OpenACC v2.0 added dynamic (also called unstructured) data regions. This dynamic data region construct was specifically created for more complex data management scenarios, such as constructors and destructors in C++. Rather than using scoping braces to define the data region, the pragma has an enter and an exit clause:
#pragma acc enter data #pragma acc exit data
#pragma acc enter data #pragma acc exit data
对于 exit data 指令,我们可以使用一个额外的 delete 子句。最好在发生 allocations 和 dislocations 的地方使用 enter/exit data 指令。enter data 指令应放在 allocation 之后,exit data 指令应插入到 deallocation 之前。这更自然地遵循应用程序中变量的现有数据范围。一旦您希望获得比循环级策略更高的性能,这些动态数据区域就变得很重要。随着动态数据区域范围的扩大,需要一个额外的指令来更新数据:
For the exit data directive, there is an additional delete clause that we can use. This use of the enter/exit data directive is best done where allocations and deallocations occur. The enter data directive should be placed just after an allocation, and the exit data directive should be inserted just before the deallocation. This more naturally follows the existing data scope of variables in an application. Once you want higher performance than what can be achieved from the loop-level strategy, these dynamic data regions become important. With the larger scope of the dynamic data regions, there is a need for an additional directive to update data:
#pragma acc update [self(x) | device(x)]
#pragma acc update [self(x) | device(x)]
device 参数指定要更新设备上的数据。self 参数表示更新本地数据,通常是数据的主机版本。
The device argument specifies that the data on the device is to be updated. The self argument says to update the local data, which is usually the host version of the data.
让我们看一下清单 11.6 中使用动态数据编译指示的示例。enter data 指令位于第 12 行的 ALLOCATION 之后。第 35 行的 exit data 指令入到 dislocations 之前。我们建议在几乎所有代码中优先使用动态数据区域,而不是结构化数据区域,但最简单的代码除外。
Let’s look at an example using a dynamic data pragma in listing 11.6. The enter data directive is placed after the allocation at line 12. The exit data directive at line 35 is inserted before the deallocations. We suggest using dynamic data regions in preference to structured data regions in almost all but the simplest code.
Listing 11.6 Creating dynamic data regions
OpenACC/StreamTriad/StreamTriad_par3.c 8 double* restrict a = malloc(nsize * sizeof(double)); 9 double* restrict b = malloc(nsize * sizeof(double)); 10 double* restrict c = malloc(nsize * sizeof(double)); 11 12 #pragma acc enter data create(a[0:nsize],\ ❶ b[0:nsize],c[0:nsize]) ❶ 13 14 struct timespec tstart; 15 // initializing data and arrays 16 double scalar = 3.0, time_sum = 0.0; 17 #pragma acc parallel loop present(a[0:nsize],b[0:nsize]) 18 for (int i=0; i<nsize; i++) { 19 a[i] = 1.0; 20 b[i] = 2.0; 21 } 22 23 for (int k=0; k<ntimes; k++){ 24 cpu_timer_start(&tstart); 25 // stream triad loop 26 #pragma acc parallel loop present(a[0:nsize],b[0:nsize],c[0:nsize]) 27 for (int i=0; i<nsize; i++){ 28 c[i] = a[i] + scalar*b[i]; 29 } 30 time_sum += cpu_timer_stop(tstart); 31 } 32 33 printf("Average runtime for stream triad loop is %lf msecs\n", time_sum/ntimes); 34 35 #pragma acc exit data delete(a[0:nsize],\. ❷ b[0:nsize],c[0:nsize]) ❷ 36 37 free(a); 38 free(b); 39 free(c);
OpenACC/StreamTriad/StreamTriad_par3.c 8 double* restrict a = malloc(nsize * sizeof(double)); 9 double* restrict b = malloc(nsize * sizeof(double)); 10 double* restrict c = malloc(nsize * sizeof(double)); 11 12 #pragma acc enter data create(a[0:nsize],\ ❶ b[0:nsize],c[0:nsize]) ❶ 13 14 struct timespec tstart; 15 // initializing data and arrays 16 double scalar = 3.0, time_sum = 0.0; 17 #pragma acc parallel loop present(a[0:nsize],b[0:nsize]) 18 for (int i=0; i<nsize; i++) { 19 a[i] = 1.0; 20 b[i] = 2.0; 21 } 22 23 for (int k=0; k<ntimes; k++){ 24 cpu_timer_start(&tstart); 25 // stream triad loop 26 #pragma acc parallel loop present(a[0:nsize],b[0:nsize],c[0:nsize]) 27 for (int i=0; i<nsize; i++){ 28 c[i] = a[i] + scalar*b[i]; 29 } 30 time_sum += cpu_timer_stop(tstart); 31 } 32 33 printf("Average runtime for stream triad loop is %lf msecs\n", time_sum/ntimes); 34 35 #pragma acc exit data delete(a[0:nsize],\. ❷ b[0:nsize],c[0:nsize]) ❷ 36 37 free(a); 38 free(b); 39 free(c);
❶ Starts the dynamic data region after memory allocation
❷ Ends the dynamic data region before memory deallocation
如果您仔细观察前面的清单,您将注意到数组 a、b 和 c 在主机和设备上都分配了,但仅在设备上使用。在列表 11.7 中,我们展示了一种解决这个问题的方法,即使用 acc_malloc 例程,然后将 deviceptr 子句放在计算区域上。
If you paid close attention to the previous listing, you will have noticed that the arrays a, b, and c are allocated on both the host and the device, but are only used on the device. In listing 11.7, we show one way to fix this by using the acc_malloc routine and then putting the deviceptr clause on the compute regions.
Listing 11.7 Allocating data only on the device
OpenACC/StreamTriad/StreamTriad_par4.c
1 #include <stdio.h>
2 #include <openacc.h>
3 #include "timer.h"
4
5 int main(int argc, char *argv[]){
6
7 int nsize = 20000000, ntimes=16
8 double* restrict a_d = ❶
acc_malloc(nsize * sizeof(double)); ❶
9 double* restrict b_d = ❶
acc_malloc(nsize * sizeof(double)); ❶
10 double* restrict c_d = ❶
acc_malloc(nsize * sizeof(double)); ❶
11
12 struct timespec tstart;
13 // initializing data and arrays
14 const double scalar = 3.0;
15 double time_sum = 0.0;
16 #pragma acc parallel loop deviceptr(a_d, b_d) ❷
17 for (int i=0; i<nsize; i++) {
18 a_d[i] = 1.0;
19 b_d[i] = 2.0;
20 }
21
22 for (int k=0; k<ntimes; k++){
23 cpu_timer_start(&tstart);
24 // stream triad loop
25 #pragma acc parallel loop deviceptr(a_d, b_d, ❷
c_d) ❷
26 for (int i=0; i<nsize; i++){
27 c_d[i] = a_d[i] + scalar*b_d[i];
28 }
29 time_sum += cpu_timer_stop(tstart);
30 }
31
32 printf("Average runtime for stream triad loop is %lf msecs\n",
time_sum/ntimes);
33
34 acc_free(a_d); ❸
35 acc_free(b_d); ❸
36 acc_free(c_d); ❸
37
38 return(0);
39 }OpenACC/StreamTriad/StreamTriad_par4.c
1 #include <stdio.h>
2 #include <openacc.h>
3 #include "timer.h"
4
5 int main(int argc, char *argv[]){
6
7 int nsize = 20000000, ntimes=16
8 double* restrict a_d = ❶
acc_malloc(nsize * sizeof(double)); ❶
9 double* restrict b_d = ❶
acc_malloc(nsize * sizeof(double)); ❶
10 double* restrict c_d = ❶
acc_malloc(nsize * sizeof(double)); ❶
11
12 struct timespec tstart;
13 // initializing data and arrays
14 const double scalar = 3.0;
15 double time_sum = 0.0;
16 #pragma acc parallel loop deviceptr(a_d, b_d) ❷
17 for (int i=0; i<nsize; i++) {
18 a_d[i] = 1.0;
19 b_d[i] = 2.0;
20 }
21
22 for (int k=0; k<ntimes; k++){
23 cpu_timer_start(&tstart);
24 // stream triad loop
25 #pragma acc parallel loop deviceptr(a_d, b_d, ❷
c_d) ❷
26 for (int i=0; i<nsize; i++){
27 c_d[i] = a_d[i] + scalar*b_d[i];
28 }
29 time_sum += cpu_timer_stop(tstart);
30 }
31
32 printf("Average runtime for stream triad loop is %lf msecs\n",
time_sum/ntimes);
33
34 acc_free(a_d); ❸
35 acc_free(b_d); ❸
36 acc_free(c_d); ❸
37
38 return(0);
39 }
❶ Allocates memory on the device. _d indicates a device pointer.
❷ deviceptr 子句告诉编译器 device 上已有内存。
❷ The deviceptr clause tells the compiler that memory is already on the device.
❸ Deallocates memory on the device
The output from the PGI compiler is now much shorter as shown here:
16 Generating Tesla code 17 #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 25 Generating Tesla code 26 #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
16 Generating Tesla code 17 #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 25 Generating Tesla code 26 #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
消除了数据移动,减少了主机上的内存需求。我们仍然有一些输出对生成的内核提供反馈,我们将在 11.2.4 节中查看。这个例子(清单 11.7)适用于 1D 数组。对于 2D 数组,deviceptr 子句不采用 descriptor 参数,因此必须更改内核以在 1D 数组中执行自己的 2D 索引。
The data movement is eliminated and memory requirements on the host reduced. We still have some output giving feedback on the generated kernel that we will look at in section 11.2.4. This example (listing 11.7) works for 1D arrays. For 2D arrays, the deviceptr clause does not take a descriptor argument, so the kernel has to be changed to do its own 2D indexing in a 1D array.
在引用数据区域时,您可以使用一组丰富的数据指令和数据移动子句来减少不必要的数据移动。尽管如此,还有更多我们尚未涵盖的子句和 OpenACC 函数在特殊情况下可能很有用。
When referencing data regions, you have available a rich set of data directives and data movement clauses that you can use to reduce unnecessary data movement. Still, there are more clauses and OpenACC functions that we have not covered that can be useful in specialized situations.
通常,与优化 GPU 内核本身相比,在 GPU 上运行更多内核并减少数据移动的影响更大。OpenACC 编译器在生成内核方面做得很好,进一步优化的潜在收益很小。有时,您可以帮助编译器提高关键内核的性能,使其值得付出一些努力。
Generally, you will have greater impact getting more kernels running on the GPU and reducing the data movement than optimizing the GPU kernels themselves. The OpenACC compiler does a good job at producing the kernels, and the potential gains from further optimizations will be small. Occasionally, you can help the compiler to improve the performance of key kernels enough for that to be worth some effort.
在本节中,我们将介绍这些优化的一般策略。首先,我们将介绍 OpenACC 标准中使用的术语。如图 11.3 所示,OpenACC 定义了适用于多个硬件设备的抽象并行级别。
In this section, we’ll go over the general strategies for these optimizations. First we’ll go over the terminology used in the OpenACC standard. As figure 11.3 shows, OpenACC defines abstract levels of parallelism that apply over multiple hardware devices.
图 11.3 OpenACC 中的层级层次结构:gangs、worker 和 vectors
Figure 11.3 The hierarchy of the levels in OpenACC: gangs, workers, and vectors
OpenACC defines these levels of parallelism:
Gang — 共享资源的独立工作块。帮派也可以在组内同步,但不能跨组同步。对于 GPU,可以将 gang 映射到 CUDA 线程块或 OpenCL 工作组。
Gang—An independent work block that shares resources. A gang can also synchronize within the group but not across the groups. For GPUs, gangs can be mapped to CUDA thread blocks or OpenCL work groups.
Workers—A warp in CUDA or work items within a work group in OpenCL.
Vector—A SIMD vector on the CPU and a SIMT work group or warp on the GPU with contiguous memory references.
Some examples of setting the level of a particular loop directive follow:
#pragma acc parallel loop vector #pragma acc parallel loop gang #pragma acc parallel loop gang vector
#pragma acc parallel loop vector #pragma acc parallel loop gang #pragma acc parallel loop gang vector
外循环必须是 gang 循环,内循环应该是向量循环。工作线程循环可以出现在两者之间。顺序 (seq) 循环可以出现在任何级别。
The outer loop must be a gang loop, and the inner loop should be a vector loop. A worker loop can appear in between. A sequential (seq) loop can appear at any level.
对于大多数当前的 GPU,向量长度应设置为 32 的倍数,因此它是 warp 大小的整数倍。它不应大于每个块的最大线程数,在当前 GPU 上通常约为 1,024 个(参见图 11.2 中 pgaccelinfo 命令的输出)。对于此处的示例,PGI 编译器将向量长度设置为合理的值 128。可以使用 vector_length(x) 指令更改循环的值。
For most current GPUs, the vector length should be set to multiples of 32, so it is an integer multiple of the warp size. It should be no larger than the maximum threads per block, which is commonly around 1,024 on current GPUs (see the output from the pgaccelinfo command in figure 11.2). For the examples here, the PGI compiler sets the vector length to a reasonable value of 128. The value can be changed for a loop with the vector_length(x) directive.
在什么情况下应更改 vector_length 设置?如果连续数据的内循环小于 128,则部分向量将未使用。在这种情况下,减小此值可能会有所帮助。另一种选择是折叠几个内部循环以获得更长的 vector,我们稍后将讨论。
In what scenario should you change the vector_length setting? If the inner loop of contiguous data is less than 128, part of the vector will go unused. In this case, reducing this value can be helpful. Another option would be to collapse a couple of the inner loops to get a longer vector as we will discuss shortly.
您可以使用 num_workers 子句修改 worker 设置。但是,对于本章中的示例,没有使用它。即便如此,在缩短 vector length 或用于额外的并行化级别时增加它也可能很有用。如果您的代码需要在并行工作组内同步,则应使用 worker 级别,但 OpenACC 不向用户提供同步指令。worker 级别还共享缓存和本地内存等资源。
You can modify the worker setting with the num_workers clause. For the examples in this chapter, however, it is not used. Even so, it can be useful to increase it when shortening the vector length or for an additional level of parallelization. If your code needs to synchronize within the parallel work group, you should use the worker level, but OpenACC does not provide a user with a synchronization directive. The worker level also shares resources such as cache and local memory.
其余的并行化是通过 gangs 完成的,这些 gangs 是异步并行级别。在 GPU 上,许多团伙对于隐藏延迟和高占用率非常重要。通常,编译器会将 this 设置为一个较大的数字,因此用户无需重写它。在您可能需要执行此操作的远程机会中,有一个 num_gangs 子句可用。
The rest of the parallelization is done with gangs, which are the asynchronous parallel level. Lots of gangs are important on GPUs to hide latency and for high occupancy. Generally, the compiler sets this to a large number, so there is no need for the user to override it. There is a num_gangs clause available in the remote chance you may need to do this.
其中许多设置仅适用于特定硬件。子句前面的 device_type(type) 将其限制为指定的设备类型。设备类型设置将保持活动状态,直到遇到下一个设备类型子句。例如
Many of these settings will only be appropriate for a particular piece of hardware. The device_type(type) before a clause restricts it to the specified device type. The device type setting stays active until the next device type clause is encountered. For example
1 #pragma acc parallel loop gang \
2 device_type(acc_device_nvidia) vector_length(256) \
3 device_type(acc_device_radeon) vector_length(64)
4 for (int j=0; j<jmax; j++){
5 #pragma acc loop vector
6 for (int i=0; i<imax; i++){
7 <work>
8 }
9 }1 #pragma acc parallel loop gang \
2 device_type(acc_device_nvidia) vector_length(256) \
3 device_type(acc_device_radeon) vector_length(64)
4 for (int j=0; j<jmax; j++){
5 #pragma acc loop vector
6 for (int i=0; i<imax; i++){
7 <work>
8 }
9 }
有关有效设备类型的列表,请查看 PGI v19.7 的 openacc.h 头文件。请注意,前面显示的 openacc.h 头文件中的行中没有acc_device_radeon,因此 PGI 编译器不支持 AMD Radeon™ 设备。这意味着在前面的示例代码中,我们需要一个 C 预处理器 ifdef 在前面的示例代码中的第 3 行附近,以防止 PGI 编译器出现抱怨。
For a list of valid device types, look at the openacc.h header file for PGI v19.7. Note that there is no acc_device_radeon in the lines from the openacc.h header file previously shown, so the PGI compiler does not support the AMD Radeon™ device. This means we need a C preprocessor ifdef around line 3 in the previous sample code to keep the PGI compiler from complaining.
Excerpt from openacc.h file for PGI
27 typedef enum{
28 acc_device_none = 0,
29 acc_device_default = 1,
30 acc_device_host = 2,
31 acc_device_not_host = 3,
32 acc_device_nvidia = 4,
33 acc_device_pgi_opencl = 7,
34 acc_device_nvidia_opencl = 8,
35 acc_device_opencl = 9,
36 acc_device_current = 10
37 } acc_device_t;Excerpt from openacc.h file for PGI
27 typedef enum{
28 acc_device_none = 0,
29 acc_device_default = 1,
30 acc_device_host = 2,
31 acc_device_not_host = 3,
32 acc_device_nvidia = 4,
33 acc_device_pgi_opencl = 7,
34 acc_device_nvidia_opencl = 8,
35 acc_device_opencl = 9,
36 acc_device_current = 10
37 } acc_device_t;
kernels 指令的语法略有不同,parallel 类型单独应用于每个 loop 指令,并直接采用 int 参数:
The syntax for the kernels directive is slightly different, with the parallel type applied to each loop directive individually and taking the int argument directly:
#pragma acc kernels loop gang
for (int j=0; j<jmax; j++){
#pragma acc loop vector(64)
for (int i=0; i<imax; i++){
<work>
}
}#pragma acc kernels loop gang
for (int j=0; j<jmax; j++){
#pragma acc loop vector(64)
for (int i=0; i<imax; i++){
<work>
}
}
循环可以与 collapse(n) 子句结合使用。如果有两个小的内部循环连续地遍历数据,则这尤其有用。通过组合这些,您可以使用更长的向量长度。循环必须紧密嵌套。
Loops can be combined with the collapse(n) clause. This is especially useful if there are two small inner loops contiguously striding through data. Combining these allows you to use a longer vector length. The loops must be tightly nested.
定义两个或多个在 for 或 do 语句之间或循环末尾之间没有额外语句的循环是紧密嵌套的循环。
Definition Two or more loops that have no extra statements between the for or do statements or between the end of the loops are tightly-nested loops.
An example of combining two loops in order to use a long vector is
#pragma acc parallel loop collapse(2) vector(32)
for (int j=0; j<8; j++){
for (int i=0; i<4; i++){
<work>
}
}#pragma acc parallel loop collapse(2) vector(32)
for (int j=0; j<8; j++){
for (int i=0; i<4; i++){
<work>
}
}
OpenACC v2.0 添加了可用于优化的 tile 子句。您可以指定平铺大小或使用星号让编译器选择:
OpenACC v2.0 added a tile clause that you can use for optimization. You can either specify the tile size or use asterisks to let the compiler choose:
#pragma acc parallel loop tile(*,*)
for (int j=0; j<jmax; j++){
for (int i=0; i<imax; i++){
<work>
}
}#pragma acc parallel loop tile(*,*)
for (int j=0; j<jmax; j++){
for (int i=0; i<imax; i++){
<work>
}
}
现在是时候尝试各种内核优化了。流三重轴示例没有显示我们的优化尝试有任何真正的好处,因此我们将使用前面许多章节中使用的模板示例。
Now it is time to try out the various kernel optimizations. The stream triad example did not show any real benefits from our optimization attempts, so we will work with the stencil example used in many of the previous chapters.
本章的模板示例的关联代码将执行相同的前两个步骤,即将计算循环移动到 GPU,然后减少数据移动。模板代码还需要一项额外的更改。在 CPU 上,我们在循环结束时交换指针。在 GPU 上的第 45-50 行中,我们必须将新数据复制回原始数组。下面的清单介绍了完成这些步骤的模板代码示例。
The associated code for the stencil example for this chapter goes through the same first two steps of moving the computational loops to the GPU and then reducing the data movement. The stencil code also requires one additional change. On the CPU, we swap pointers at the end of the loop. On the GPU, in lines 45-50, we have to copy the new data back to the original array. The following listing takes up the stencil code example with these steps completed.
列表 11.8 GPU 上具有计算循环和数据运动优化的模板示例
Listing 11.8 Stencil example with compute loops on the GPU and data motion optimized
OpenACC/Stencil/Stencil_par3.c 17 #pragma acc enter data create( \ ❶ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❶ 18 19 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 20 for (int j = 0; j < jmax; j++){ 21 for (int i = 0; i < imax; i++){ 22 xnew[j][i] = 0.0; 23 x[j][i] = 5.0; 24 } 25 } 26 27 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 28 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){ 29 for (int i = imax/2 - 5; i < imax/2 -1; i++){ 30 x[j][i] = 400.0; 31 } 32 } 33 34 for (int iter = 0; iter < niter; iter+=nburst){ 35 36 for (int ib = 0; ib < nburst; ib++){ 37 cpu_timer_start(&tstart_cpu); 38 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 39 for (int j = 1; j < jmax-1; j++){ 40 for (int i = 1; i < imax-1; i++){ 41 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+ x[j-1][i]+x[j+1][i])/5.0; 42 } 43 } 44 45 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 46 for (int j = 0; j < jmax; j++){ 47 for (int i = 0; i < imax; i++){ 48 x[j][i] = xnew[j][i]; 49 } 50 } 51 cpu_time += cpu_timer_stop(tstart_cpu); 52 } 53 54 printf("Iter %d\n",iter+nburst); 55 } 56 57 #pragma acc exit data delete( \ ❶ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❶
OpenACC/Stencil/Stencil_par3.c 17 #pragma acc enter data create( \ ❶ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❶ 18 19 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 20 for (int j = 0; j < jmax; j++){ 21 for (int i = 0; i < imax; i++){ 22 xnew[j][i] = 0.0; 23 x[j][i] = 5.0; 24 } 25 } 26 27 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 28 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){ 29 for (int i = imax/2 - 5; i < imax/2 -1; i++){ 30 x[j][i] = 400.0; 31 } 32 } 33 34 for (int iter = 0; iter < niter; iter+=nburst){ 35 36 for (int ib = 0; ib < nburst; ib++){ 37 cpu_timer_start(&tstart_cpu); 38 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 39 for (int j = 1; j < jmax-1; j++){ 40 for (int i = 1; i < imax-1; i++){ 41 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+ x[j-1][i]+x[j+1][i])/5.0; 42 } 43 } 44 45 #pragma acc parallel loop present( \ ❷ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❷ 46 for (int j = 0; j < jmax; j++){ 47 for (int i = 0; i < imax; i++){ 48 x[j][i] = xnew[j][i]; 49 } 50 } 51 cpu_time += cpu_timer_stop(tstart_cpu); 52 } 53 54 printf("Iter %d\n",iter+nburst); 55 } 56 57 #pragma acc exit data delete( \ ❶ x[0:jmax][0:imax], xnew[0:jmax][0:imax]) ❶
❶ Dynamic data region directives
首先,请注意,我们使用的是 dynamic data region 指令,因此没有像我们在结构化数据区域中看到的那样用大括号包裹数据区域。动态区域在遇到 enter 指令时开始数据区域,在到达 exit 指令时结束,无论两个指令之间发生什么路径。在这种情况下,它是从 enter 到 exit 指令的直线执行。我们将 collapse 子句添加到并行循环中,以减少两个循环的开销。下面的清单显示了此更改。
First, note that we are using the dynamic data region directives, so there are no braces wrapping the data region as we would see with the structured data region. The dynamic region begins the data region when it encounters the enter directive and ends when it reaches an exit directive, no matter what path occurs between the two directives. In this case, it is a straight line of execution from the enter to the exit directive. We’ll add the collapse clause to the parallel loop to reduce the overhead for the two loops. The following listing shows this change.
Listing 11.9 带有 collapse 子句的模板示例
Listing 11.9 Stencil example with a collapse clause
OpenACC/Stencil/Stencil_par4.c
36 for (int ib = 0; ib < nburst; ib++){
37 cpu_timer_start(&tstart_cpu);
38 #pragma acc parallel loop collapse(2)\ ❶
39 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
40 for (int j = 1; j < jmax-1; j++){
41 for (int i = 1; i < imax-1; i++){
42 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
43 }
44 }
45 #pragma acc parallel loop collapse(2)\
46 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
47 for (int j = 0; j < jmax; j++){
48 for (int i = 0; i < imax; i++){
49 x[j][i] = xnew[j][i];
50 }
51 }
52 cpu_time += cpu_timer_stop(tstart_cpu);
53
54 }OpenACC/Stencil/Stencil_par4.c
36 for (int ib = 0; ib < nburst; ib++){
37 cpu_timer_start(&tstart_cpu);
38 #pragma acc parallel loop collapse(2)\ ❶
39 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
40 for (int j = 1; j < jmax-1; j++){
41 for (int i = 1; i < imax-1; i++){
42 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
43 }
44 }
45 #pragma acc parallel loop collapse(2)\
46 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
47 for (int j = 0; j < jmax; j++){
48 for (int i = 0; i < imax; i++){
49 x[j][i] = xnew[j][i];
50 }
51 }
52 cpu_time += cpu_timer_stop(tstart_cpu);
53
54 }
❶ 将 collapse 子句添加到 parallel loop 指令中
❶ Adds the collapse clause to the parallel loop directive
我们也可以尝试使用 tile 子句。我们首先让编译器确定 tile 大小,如以下清单中的第 41 行和第 48 行所示。
We can also try using the tile clause. We start out by letting the compiler determine the tile size as shown in lines 41 and 48 in the following listing.
Listing 11.10 Stencil example with a tile clause
OpenACC/Stencil/Stencil_par5.c
39 for (int ib = 0; ib < nburst; ib++){
40 cpu_timer_start(&tstart_cpu);
41 #pragma acc parallel loop tile(*,*) \ ❶
42 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
43 for (int j = 1; j < jmax-1; j++){
44 for (int i = 1; i < imax-1; i++){
45 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
46 }
47 }
48 #pragma acc parallel loop tile(*,*) \
49 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
50 for (int j = 0; j < jmax; j++){
51 for (int i = 0; i < imax; i++){
52 x[j][i] = xnew[j][i];
53 }
54 }
55 cpu_time += cpu_timer_stop(tstart_cpu);
56
57 }OpenACC/Stencil/Stencil_par5.c
39 for (int ib = 0; ib < nburst; ib++){
40 cpu_timer_start(&tstart_cpu);
41 #pragma acc parallel loop tile(*,*) \ ❶
42 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
43 for (int j = 1; j < jmax-1; j++){
44 for (int i = 1; i < imax-1; i++){
45 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
46 }
47 }
48 #pragma acc parallel loop tile(*,*) \
49 present(x[0:jmax][0:imax], xnew[0:jmax][0:imax])
50 for (int j = 0; j < jmax; j++){
51 for (int i = 0; i < imax; i++){
52 x[j][i] = xnew[j][i];
53 }
54 }
55 cpu_time += cpu_timer_stop(tstart_cpu);
56
57 }
❶ 将 tile 子句添加到 parallel loop 指令中
❶ Adds the tile clause to the parallel loop directive
相对于初始 OpenACC 实现的改进,这些优化对运行时间的变化很小。表 11.1 显示了使用 PGI 编译器 v19.7 的 NVIDIA V100 GPU 的结果。
The change in the run times from these optimizations is small relative to the improvement seen from the initial OpenACC implementation. Table 11.1 shows the results for the NVIDIA V100 GPU with the PGI compiler v19.7.
Table 11.1 Run times for the OpenACC stencil kernel optimizations
我们尝试将向量长度更改为 64 或 256 以及不同的图块大小,但运行时间没有看到任何改善。更复杂的代码可以从内核优化中获得更多好处,但请注意,参数的任何专用化(例如向量长度)都会影响编译器对不同体系结构的可移植性。
We tried changing the vector length to 64 or 256 and different tile sizes, but didn’t see any improvement in the run times. More complex code can find more benefit from kernel optimizations, but note that any specialization of parameters such as vector length impacts portability by compilers for different architectures.
优化的另一个目标是在循环结束时实现指针交换。指针交换在原始 CPU 代码中用作将数据返回到原始数组的快速方法。将数据复制回原始数组会使 GPU 上的运行时间增加一倍。基于 pragma 的语言的难点在于,并行区域中的指针交换需要同时交换主机指针和设备指针。
Another target for optimization is to implement a pointer swap at the end of the loop. The pointer swap is used in the original CPU code as a fast way to get data back to the original array. The copy of the data back to the original array doubles the run time on the GPU. The difficulty in pragma-based languages is that the pointer swap in a parallel region requires swapping both the host and the device pointers at the same time.
转换为 GPU 期间的运行时性能显示了典型模式。将计算内核移动到 GPU 会导致速度减慢约 3 倍,如表 11.2 中的内核 2 和并行 1 实现所示。在内核 1 的情况下,计算循环无法并行化。在 GPU 上按顺序运行,它甚至更慢。在内核 3 和并行 2-4 中减少数据移动后,运行时间显示出 67 倍的加速。特定类型的数据区域对性能没有太大影响,但对于在更复杂的代码中启用其他循环的端口可能很重要。
The run-time performance during the conversion to the GPU shows the typical pattern. Moving the computational kernels over to the GPU results in a slow down by about a factor of 3 as shown by the kernel 2 and parallel 1 implementations in table 11.2. In the kernel 1 case, the computational loop fails to parallelize. Running sequentially on the GPU, it was even slower. Once the data movement was reduced in kernel 3 and parallel 2-4, the run times showed a 67x speedup. The particular type of data region didn’t matter so much for performance, but might be important to enable ports of additional loops in more complex codes.
Table 11.2 Run times from OpenACC stream triad kernel optimizations
OpenACC 中的许多其他功能可用于处理更复杂的代码。我们将简要介绍这些功能,以便您了解可用的功能。
Many other features in OpenACC are available to handle more complex code. We’ll cover these briefly so you know what capabilities are available.
Handling functions with the OpenACC routine directive
OpenACC v1.0 要求在内核中使用的函数被内联。版本 2.0 添加了具有两个不同版本的 routine 指令,以使调用例程更简单。这两个版本是
OpenACC v1.0 required functions for use in kernels to be inlined. Version 2.0 added the routine directive with two different versions to make calling routines simpler. The two versions are
#pragma acc routine [gang | worker | vector | seq | bind | no_host |
device_type]
#pragma acc routine(name) [gang | worker | vector | seq | bind | no_host |
device_type]#pragma acc routine [gang | worker | vector | seq | bind | no_host |
device_type]
#pragma acc routine(name) [gang | worker | vector | seq | bind | no_host |
device_type]
在 C 和 C++ 中,例程指令应紧跟在函数原型或定义之前出现。命名版本可以出现在定义或使用函数之前的任何位置。Fortran 版本应在函数体本身或接口体中包含 !#acc 例程指令。
In C and C++, the routine directive should appear immediately before a function prototype or definition. The named version can appear anywhere before the function is defined or used. The Fortran version should include the !#acc routine directive within the function body itself or in the interface body.
Avoiding race conditions with OpenACC atomics
许多线程例程都有一个共享变量,该变量必须由多个线程更新。这种编程结构既是常见的性能瓶颈,也是潜在的争用条件。为了处理这种情况,OpenACC v2 提供了原子,一次只允许一个线程访问存储位置。atomic 指令的语法和有效子句为
Many threaded routines have a shared variable that has to be updated by multiple threads. This programming construct is both a common performance bottleneck and a potential race condition. To handle this situation, OpenACC v2 provides atomics to allow only one thread to access a storage location at a time. The syntax and valid clauses for the atomic directive are
#pragma acc atomic [read | write | update | capture]
#pragma acc atomic [read | write | update | capture]
如果未指定子句,则默认值为 update。使用 atomic 子句的一个例子是
If you don’t specify a clause, the default is an update. An example of the use of the atomic clause is
#pragma acc atomic cnt++;
#pragma acc atomic cnt++;
Asynchronous operations in OpenACC
重叠 OpenACC 操作有助于提高性能。重叠操作的正确术语是异步。OpenACC 为 async 和 wait 子句和指令提供这些异步操作。async 子句被添加到带有可选 integer 参数的 work 或 data 指令中:
Overlapping OpenACC operations can help improve performance. The proper term for overlapping operations is asynchronous. OpenACC provides these asynchronous operations with the async and wait clauses and directives. The async clause is added to a work or data directive with an optional integer argument:
#pragma acc parallel loop async([<integer>])
#pragma acc parallel loop async([<integer>])
wait 可以是指令,也可以是添加到 work 或 data 指令的子句。清单 11.11 中的以下伪代码显示了如何使用它来在计算网格的 x 面和 y 面上启动计算,然后等待结果更新下一次迭代的单元值。
The wait can be either a directive or a clause added to a work or data directive. The following pseudo-code in listing 11.11 shows how you can use this to launch the calculations on the x-faces and y-faces of a computational mesh and then wait for the results to update the cell values for the next iteration.
Listing 11.11 Async wait example in OpenACC
for (int n = 0; n < ntimes; ) {
#pragma acc parallel loop async
<x face pass>
#pragma acc parallel loop async
<y face pass>
#pragma acc wait
#pragma acc parallel loop
<Update cell values from face fluxes>
}for (int n = 0; n < ntimes; ) {
#pragma acc parallel loop async
<x face pass>
#pragma acc parallel loop async
<y face pass>
#pragma acc wait
#pragma acc parallel loop
<Update cell values from face fluxes>
}
Unified Memory to avoid managing data movement
尽管统一内存目前不是 OpenACC 标准的一部分,但有一些实验性开发让系统管理内存移动。CUDA 和 PGI OpenACC 编译器中提供了这种统一内存的实验性实现。将 -ta=tesla:managed 标志与 PGI 编译器和最新的 NVIDIA GPU 结合使用,您可以尝试它们的统一内存实现。虽然编码得到了简化,但性能影响仍然未知,并且会随着编译器的成熟而改变。
Although unified memory is not currently part of the OpenACC standard, there are experimental developments with having the system manage memory movement. Such an experimental implementation of unified memory is available in CUDA and the PGI OpenACC compiler. Using the -ta=tesla:managed flag with the PGI compiler and recent NVIDIA GPUs, you can try out their unified memory implementation. While the coding is simplified, the performance impacts are still not known and will change as the compilers mature.
Interoperability with CUDA libraries or kernels
OpenACC 提供了多个指令和函数,以便与 CUDA 库进行互操作。在调用库时,必须告诉编译器使用设备指针而不是主机数据。host_data 指令可用于此目的:
OpenACC provides several directives and functions to make it possible to interoperate with CUDA libraries. In calling libraries, it is necessary to tell the compiler to use the device pointers instead of host data. The host_data directive can be used for this purpose:
#pragma acc host_data use_device(x, y) cublasDaxpy(n, 2.0, x, 1, y, 1);
#pragma acc host_data use_device(x, y) cublasDaxpy(n, 2.0, x, 1, y, 1);
我们在 Listing 11.7 中使用 acc_malloc 分配内存时展示了一个类似的例子。使用 acc_malloc 或 cudaMalloc 时,返回的指针已经在设备上。对于这种情况,我们使用 deviceptr 子句将指针传递给数据区域。
We showed a similar example when we allocated memory using acc_malloc in listing 11.7. With acc_malloc or cudaMalloc, the pointer returned is already on the device. For this case, we used the deviceptr clause to pass the pointer to the data region.
用任何语言对 GPU 进行编程时,最常见的错误之一是混淆设备指针和主机指针。尝试找到 86 Pike Place, San Francisco,而实际上是 86 Pike Place, Seattle。设备指针指向 GPU 硬件上的不同物理内存块。
One of the most common mistakes in programming GPUs in any language is confusing a device pointer and a host pointer. Try finding 86 Pike Place, San Francisco, when it is really 86 Pike Place, Seattle. The device pointer points to a different physical block of memory on the GPU hardware.
图 11.4 显示了我们介绍的三种不同操作,以帮助您了解差异。在第一种情况下, malloc 例程返回主机指针。present 子句将此转换为 device kernel 的设备指针。在第二种情况下,我们使用 acc_malloc 或 cudaMalloc 在设备上分配内存,我们会得到一个设备指针。我们使用 deviceptr 子句将其发送到 GPU,无需任何更改。在最后一种情况下,我们在主机上根本没有指针。我们必须使用 host_data use_device(var) 指令来检索指向主机的设备指针。这样做是为了让我们有一个指针可以发送回 device 函数的参数列表中的 device。
Figure 11.4 shows the three different operations we have covered to help you understand the differences. In the first case, the malloc routine returns a host pointer. The present clause converts this to a device pointer for the device kernel. In the second case, where we allocate memory on the device with acc_malloc or cudaMalloc, we are given a device pointer. We use the deviceptr clause to send it to the GPU without any changes. In the last case, we don’t have a pointer on the host at all. We have to use the host_data use_device(var) directive to retrieve the device pointer to the host. This is done so that we have a pointer to send back to the device in the argument list for the device function.
图 11.4 是设备指针还是主机指针?一个分别指向 GPU 内存,另一个指向 CPU 内存。OpenACC 在两个地址空间中的数组之间保持映射,并提供用于检索每个地址的例程。
Figure 11.4 Is it a device pointer or a host pointer? One points to the GPU memory and the other to the CPU memory, respectively. OpenACC keeps a map between arrays in the two address spaces and provides routines for retrieving each.
最好将 _h 或 _d 附加到指针以阐明其有效上下文。在我们的示例中,假定所有指针和数组都位于主机上,但以 _d 结尾的指针和数组除外,这适用于任何设备指针。
It is good practice to append a _h or _d to pointers to clarify their valid context. In our examples, all pointers and arrays are assumed to be on the host except for those ending with _d, which is for any device pointer.
Managing multiple devices in OpenACC
许多当前的 HPC 系统已经有多个 GPU。我们还可以预见,我们将获得具有不同加速器的节点。管理我们使用的设备的能力变得越来越重要。OpenACC 通过以下功能为我们提供了此功能:
Many current HPC systems already have multiple GPUs. We can also foresee that we will get nodes with different accelerators. The ability to manage which device we are using becomes more and more important. OpenACC gives us this capability through the following functions:
我们现在已经用十几页的时间介绍了尽可能多的 OpenACC。我们向您展示的技能足以让您开始实施。OpenACC 标准中提供了更多功能,但其中大部分用于更复杂的情况或入门级应用程序不需要的低级接口。
We have now covered as much of OpenACC as we can in a dozen pages. The skills we’ve shown you are enough to get you started on an implementation. There is a lot more functionality available in the OpenACC standard, but much of it is for more complex situations or low-level interfaces that are not necessary for entry-level applications.
OpenMP 加速器功能是对传统线程模型的一个令人兴奋的补充。在本节中,我们将向您展示如何开始使用这些指令。我们将使用与 OpenACC 第 11.2 节相同的示例。在本节结束时,您应该对两种相似语言的比较以及哪种语言可能是您的应用程序的更好选择有所了解。
The OpenMP accelerator capability is an exciting addition to the traditional threading model. In this section, we show you how to get started with these directives. We’ll use the same examples as we did for the OpenACC section 11.2. By the end of this section, you should have some idea of how the two similar languages compare and which might be the better choice for your application.
与 OpenACC 相比,OpenMP 的加速器指令处于什么位置?尽管 OpenMP 实现正在迅速改进,但目前明显不太成熟。GPU 当前可用的实现如下:
Where do OpenMP’s accelerator directives stand in comparison to OpenACC? The OpenMP implementations are notably less mature at this point, though rapidly improving. The currently available implementations for GPUs are as follows:
Cray 于 2015 年首次推出面向 NVIDIA GPU 的 OpenMP 实施。Cray 现在支持 OpenMP v4.5。
Cray was first with an OpenMP implementation targeting NVIDIA GPUs in 2015. Cray now supports OpenMP v4.5.
IBM fully supports OpenMP v4.5 on Power 9 processor and NVIDIA GPUs.
GCC v6+ can offload to AMD GPUs; v7+ can offload to NVIDIA GPUs.
两个最成熟的实现 Cray 和 IBM 仅在各自的系统上可用。遗憾的是,并非每个人都可以访问这些供应商提供的系统,但有更广泛可用的编译器。其中两个编译器 Clang 和 GCC 正处于开发过程中,现在已经有边缘版本可用。请留意这些编译器的新发展。本节中的示例使用 IBM® XL 16 编译器和 CUDA v10。
The two most mature implementations, Cray and IBM, are available only on their respective systems. Unfortunately, not everyone has access to systems from these vendors, but there are more widely available compilers. Two of these compilers, Clang and GCC, are in the throes of development with marginal versions available now. Look out for new developments with these compilers. The examples in this section use the IBM® XL 16 compiler and CUDA v10.
我们首先从如何设置构建环境和编译 OpenMP 代码开始。CMake 有一个 OpenMP 模块,但它没有对 OpenMP 加速器指令的明确支持。我们包括一个 OpenMPAccel 模块,该模块调用常规 OpenMP 模块并添加加速器所需的标志。它还会检查支持的 OpenMP 版本,如果不是 v4.0 或更高版本,则会生成错误。此 CMake 模块包含在本章的源代码中。
We start with how to set up a build environment and compile an OpenMP code. CMake has an OpenMP module, but it does not have explicit support for the OpenMP accelerator directives. We include an OpenMPAccel module that calls the regular OpenMP module and adds the flags needed for the accelerator. It also checks the OpenMP version that is supported, and if it is not v4.0 or newer, it generates an error. This CMake module is included with the source code for the chapter.
清单 11.12 显示了本章中主 CMakeLists.txt 文件的摘录。目前来自大多数 OpenMP 编译器的反馈很弱,因此为 CMake 设置 -DCMAKE_OPENMPACCEL 标志的好处很小。我们将利用这些示例中的其他工具来填补空白。
Listing 11.12 shows excerpts from the main CMakeLists.txt file in this chapter. Feedback from most of the OpenMP compilers is weak right now, so setting the -DCMAKE_OPENMPACCEL flag for CMake will only have minimal benefit. We’ll leverage other tools in these examples to fill in the gap.
Listing 11.12 OpenMPaccel makefile 的摘录
Listing 11.12 Excerpts from an OpenMPaccel makefile
OpenMP/StreamTriad/CMakeLists.txt
10 if (NOT CMAKE_OPENMPACCEL_VERBOSE)
11 set(CMAKE_OPENMPACCEL_VERBOSE true)
12 endif (NOT CMAKE_OPENMPACCEL_VERBOSE)
13
14 if (CMAKE_C_COMPILER_ID MATCHES "GNU")
15 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fstrict-aliasing")
16 elseif (CMAKE_C_COMPILER_ID MATCHES "Clang")
17 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fstrict-aliasing")
18 elseif (CMAKE_C_COMPILER_ID MATCHES "XL")
19 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -qalias=ansi")
20 elseif (CMAKE_C_COMPILER_ID MATCHES "Cray")
21 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -h restrict=a")
22 endif (CMAKE_C_COMPILER_ID MATCHES "GNU")
23
24 find_package(OpenMPAccel) ❶
25
26 if (CMAKE_C_COMPILER_ID MATCHES "XL")
27 set(OpenMPAccel_C_FLAGS ❷
"${OpenMPAccel_C_FLAGS} -qreport") ❷
28 elseif (CMAKE_C_COMPILER_ID MATCHES "GNU")
29 set(OpenMPAccel_C_FLAGS
"${OpenMPAccel_C_FLAGS} -fopt-info-omp") ❷
30 endif (CMAKE_C_COMPILER_ID MATCHES "XL")
31
32 if (CMAKE_OPENMPACCEL_VERBOSE)
33 set(OpenACC_C_FLAGS "${OpenACC_C_FLAGS} ${OpenACC_C_VERBOSE}")
34 endif (CMAKE_OPENMPACCEL_VERBOSE)
35
36 # Adds build target of stream_triad_par1 with source code files
37 add_executable(StreamTriad_par1 StreamTriad_par1.c timer.c timer.h)
38 set_target_properties(StreamTriad_par1 PROPERTIES
COMPILE_FLAGS ${OpenMPAccel_C_FLAGS}) ❸
39 set_target_properties(StreamTriad_par1 PROPERTIES
LINK_FLAGS "${OpenMPAccel_C_FLAGS}") ❸OpenMP/StreamTriad/CMakeLists.txt
10 if (NOT CMAKE_OPENMPACCEL_VERBOSE)
11 set(CMAKE_OPENMPACCEL_VERBOSE true)
12 endif (NOT CMAKE_OPENMPACCEL_VERBOSE)
13
14 if (CMAKE_C_COMPILER_ID MATCHES "GNU")
15 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fstrict-aliasing")
16 elseif (CMAKE_C_COMPILER_ID MATCHES "Clang")
17 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fstrict-aliasing")
18 elseif (CMAKE_C_COMPILER_ID MATCHES "XL")
19 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -qalias=ansi")
20 elseif (CMAKE_C_COMPILER_ID MATCHES "Cray")
21 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -h restrict=a")
22 endif (CMAKE_C_COMPILER_ID MATCHES "GNU")
23
24 find_package(OpenMPAccel) ❶
25
26 if (CMAKE_C_COMPILER_ID MATCHES "XL")
27 set(OpenMPAccel_C_FLAGS ❷
"${OpenMPAccel_C_FLAGS} -qreport") ❷
28 elseif (CMAKE_C_COMPILER_ID MATCHES "GNU")
29 set(OpenMPAccel_C_FLAGS
"${OpenMPAccel_C_FLAGS} -fopt-info-omp") ❷
30 endif (CMAKE_C_COMPILER_ID MATCHES "XL")
31
32 if (CMAKE_OPENMPACCEL_VERBOSE)
33 set(OpenACC_C_FLAGS "${OpenACC_C_FLAGS} ${OpenACC_C_VERBOSE}")
34 endif (CMAKE_OPENMPACCEL_VERBOSE)
35
36 # Adds build target of stream_triad_par1 with source code files
37 add_executable(StreamTriad_par1 StreamTriad_par1.c timer.c timer.h)
38 set_target_properties(StreamTriad_par1 PROPERTIES
COMPILE_FLAGS ${OpenMPAccel_C_FLAGS}) ❸
39 set_target_properties(StreamTriad_par1 PROPERTIES
LINK_FLAGS "${OpenMPAccel_C_FLAGS}") ❸
❶ CMake 模块为 OpenMP 加速器设备设置编译器标志。
❶ CMake module sets compiler flags for OpenMP accelerator devices.
❷ Adds compiler feedback for accelerator directives
❸ Adds OpenMP accelerator flags for compiling and linking of stream triad
简单的 makefile 还可用于构建示例代码,方法是通过以下任一方式将这些代码复制或链接到 Makefile:
The simple makefile can also be used for building the example codes by copying or linking these over to Makefile with either of the following:
ln -s Makefile.simple.xl Makefile cp Makefile.simple.xl Makefile
ln -s Makefile.simple.xl Makefile cp Makefile.simple.xl Makefile
以下代码片段显示了 IBM XL 和 GCC 编译器的简单 makefile 中 OpenMP 加速器指令的建议标志:
The following code snippet shows the suggested flags for the OpenMP accelerator directives in the simple makefiles for the IBM XL and GCC compilers:
Makefile.simple.xl
6 CFLAGS:=-qthreaded -g -O3 -std=gnu99 -qalias=ansi -qhot -qsmp=omp \
-qoffload -qreport
7
8 %.o: %.c
9 ${CC} ${CFLAGS} -c $^
10
11 StreamTriad: StreamTriad.o timer.o
12 ${CC} ${CFLAGS} $^ -o StreamTriad
Makefile.simple.gcc
6 CFLAGS:= -g -O3 -std=gnu99 -fstrict-aliasing \
7 -fopenmp -foffload=nvptx-none -foffload=-lm -fopt-info-omp
8
9 %.o: %.c
10 ${CC} ${CFLAGS} -c $^
11
12 StreamTriad: StreamTriad.o timer.o
13 ${CC} ${CFLAGS} $^ -o StreamTriadMakefile.simple.xl
6 CFLAGS:=-qthreaded -g -O3 -std=gnu99 -qalias=ansi -qhot -qsmp=omp \
-qoffload -qreport
7
8 %.o: %.c
9 ${CC} ${CFLAGS} -c $^
10
11 StreamTriad: StreamTriad.o timer.o
12 ${CC} ${CFLAGS} $^ -o StreamTriad
Makefile.simple.gcc
6 CFLAGS:= -g -O3 -std=gnu99 -fstrict-aliasing \
7 -fopenmp -foffload=nvptx-none -foffload=-lm -fopt-info-omp
8
9 %.o: %.c
10 ${CC} ${CFLAGS} -c $^
11
12 StreamTriad: StreamTriad.o timer.o
13 ${CC} ${CFLAGS} $^ -o StreamTriad
现在我们需要在 GPU 上生成并行工作。OpenMP 设备并行抽象比我们在 OpenACC 中看到的要复杂得多。但这也可以为将来安排工作提供更大的灵活性。现在,您应该在每个循环前面加上以下指令:
Now we need to generate parallel work on the GPU. The OpenMP device parallel abstractions are more complicated than we saw with OpenACC. But this can also provide more flexibility in scheduling work in the future. For now, you should preface each loop with this directive:
#pragma omp target teams distribute parallel for simd
#pragma omp target teams distribute parallel for simd
这是一个漫长而令人困惑的指令。让我们回顾一下图 11.5 中所示的每个部分。前三个子句指定硬件资源:
This is a long, confusing directive. Let’s go over each of the parts as illustrated in figure 11.5. The first three clauses specify hardware resources:
图 11.5 target、teams 和 distribute 指令支持更多硬件资源。parallel for simd 指令将工作分散到每个工作组中。
Figure 11.5 The target, teams, and distribute directives enable more hardware resources. The parallel for simd directive spreads out the work within each workgroup.
其余三个是 parallel work 子句。所有三个子句都是可移植性所必需的。这是因为编译器的实现以不同的方式分散了工作。
The remaining three are the parallel work clauses. All three clauses are necessary for portability. This is because the implementations by compilers spread out the work in different manners.
对于具有三个嵌套循环的内核,可以分散工作的一种方法是使用以下方法:
For kernels with three nested loops, one way you can spread out the work is with the following:
k loop: #pragma omp target teams distribute j loop: #pragma omp parallel for i loop: #pragma omp simd
k loop: #pragma omp target teams distribute j loop: #pragma omp parallel for i loop: #pragma omp simd
每个 OpenMP 编译器可以以不同的方式分配工作,因此需要此方案的一些变体。simd 循环应该是跨连续内存位置的内部循环。OpenMP v5.0 中的 loop 子句对这种复杂性进行了一些简化,我们将在第 11.3.5 节中介绍。您还可以向此指令添加子句:
Each OpenMP compiler can spread out the work differently, thus requiring some variants of this scheme. The simd loop should be the inner loop across contiguous memory locations. Some simplification of this complexity is being introduced with the loop clause in OpenMP v5.0 as we will present in section 11.3.5. You can also add clauses to this directive:
private, firstprivate, lastprivate, shared, reduction, collapse,
dist_scheduleprivate, firstprivate, lastprivate, shared, reduction, collapse,
dist_schedule
其中许多子句与 OpenACC 相似,并且行为方式相同。与 OpenACC 的主要区别之一是进入并行工作区时处理数据的默认方式。OpenACC 编译器通常会将所有必要的数组移动到设备中。对于 OpenMP,有两种可能性:
Many of these clauses are familiar from OpenACC and behave the same way. One of the major differences from OpenACC is the default way that data is handled when entering a parallel work region. OpenACC compilers generally move all necessary arrays to the device. For OpenMP, there are two possibilities:
Scalars and statically allocated arrays are moved onto the device by default before execution.
Data allocated on the heap needs to be explicitly copied to and from the device.
让我们看一个在清单 11.13 中添加 parallel work 指令的简单示例。我们使用静态分配的数组,这些数组的行为就像它们在堆栈上分配一样;但是,由于大小较大,实际内存可能由编译器在堆上分配。
Let’s look at a simple example of adding a parallel work directive in listing 11.13. We use statically allocated arrays that behave as if they are allocated on the stack; although, because of the large size, the actual memory might be allocated by the compiler on the heap.
清单 11.13 添加 OpenMP 编译指示以并行化 GPU 上的工作
Listing 11.13 Adding OpenMP pragmas to parallelize work on the GPU
OpenMP/StreamTriad/StreamTriad_par1.c
6 int main(int argc, char *argv[]){
7
8 int nsize = 20000000, ntimes=16;
9 double a[nsize]; ❶
10 double b[nsize]; ❶
11 double c[nsize]; ❶
12
13 struct timespec tstart;
14 // initializing data and arrays
15 double scalar = 3.0, time_sum = 0.0;
16 #pragma omp target teams distribute parallel for simd
17 for (int i=0; i<nsize; i++) {
18 a[i] = 1.0;
19 b[i] = 2.0;
20 }
21
22 for (int k=0; k<ntimes; k++){
23 cpu_timer_start(&tstart);
24 // stream triad loop
25 #pragma omp target teams distribute parallel for simd
26 for (int i=0; i<nsize; i++){
27 c[i] = a[i] + scalar*b[i];
28 }
29 time_sum += cpu_timer_stop(tstart);
30 }
31
32 printf("Average runtime for stream triad loop is %lf secs\n",
time_sum/ntimes);OpenMP/StreamTriad/StreamTriad_par1.c
6 int main(int argc, char *argv[]){
7
8 int nsize = 20000000, ntimes=16;
9 double a[nsize]; ❶
10 double b[nsize]; ❶
11 double c[nsize]; ❶
12
13 struct timespec tstart;
14 // initializing data and arrays
15 double scalar = 3.0, time_sum = 0.0;
16 #pragma omp target teams distribute parallel for simd
17 for (int i=0; i<nsize; i++) {
18 a[i] = 1.0;
19 b[i] = 2.0;
20 }
21
22 for (int k=0; k<ntimes; k++){
23 cpu_timer_start(&tstart);
24 // stream triad loop
25 #pragma omp target teams distribute parallel for simd
26 for (int i=0; i<nsize; i++){
27 c[i] = a[i] + scalar*b[i];
28 }
29 time_sum += cpu_timer_stop(tstart);
30 }
31
32 printf("Average runtime for stream triad loop is %lf secs\n",
time_sum/ntimes);
❶ Allocating static arrays on the host
来自 IBM XL 编译器的反馈显示,两个内核被卸载到 GPU,但没有提供其他信息。GCC 根本没有给出任何反馈。IBM XL 输出为
The feedback from the IBM XL compiler shows that the two kernels are offloaded to the GPU, but no other information is proffered. GCC gives no feedback at all. The IBM XL output is
"" 1586-672 (I) GPU OpenMP Runtime elided for offloaded kernel
'__xl_main_l15_OL_1'
"" 1586-672 (I) GPU OpenMP Runtime elided for offloaded kernel
'__xl_main_l23_OL_2'"" 1586-672 (I) GPU OpenMP Runtime elided for offloaded kernel
'__xl_main_l15_OL_1'
"" 1586-672 (I) GPU OpenMP Runtime elided for offloaded kernel
'__xl_main_l23_OL_2'
要获取有关 IBM XL 编译器所做的一些信息,我们将使用 NVIDIA 分析器:
To get some information on what the IBM XL compiler has done, we’ll use the NVIDIA profiler:
nvprof ./StreamTriad_par1
nvprof ./StreamTriad_par1
The first part of the output is
从此输出中,我们现在知道有一个从主机到设备的内存副本(输出中的 HtoD),然后从设备返回到主机(输出中的 DtoH)。GCC 的 nvprof 输出类似,但没有行号。有关操作发生顺序的更多详细信息,请参阅以下内容:
From this output, we now know that there is a memory copy from the host to the device (HtoD in the output) and then back from the device to the host (DtoH in the output). The nvprof output from GCC is similar but without line numbers. More detail about the order in which the operations occur can be obtained with the following:
nvprof --print-gpu-trace ./StreamTriad_par1
nvprof --print-gpu-trace ./StreamTriad_par1
大多数程序不是使用静态分配的数组编写的。让我们看一下一种更常见的情况,其中数组是动态分配的,如下面的清单所示。
Most programs are not written with statically allocated arrays. Let’s take a look at a more commonly found case where the arrays are dynamically allocated as the following listing shows.
列表 11.14 动态分配数组的 parallel work 指令
Listing 11.14 Parallel work directive with arrays dynamically allocated
OpenMP/StreamTriad/StreamTriad_par2.c
9 double* restrict a =
malloc(nsize * sizeof(double)); ❶
10 double* restrict b =
malloc(nsize * sizeof(double)); ❶
11 double* restrict c =
malloc(nsize * sizeof(double)); ❶
12
13 struct timespec tstart;
14 // initializing data and arrays
15 double scalar = 3.0, time_sum = 0.0;
16 #pragma omp target teams distribute \ ❷
parallel for simd \ ❷
17 map(a[0:nsize], b[0:nsize],
c[0:nsize]) ❷
18 for (int i=0; i<nsize; i++) {
19 a[i] = 1.0;
20 b[i] = 2.0;
21 }
22
23 for (int k=0; k<ntimes; k++){
24 cpu_timer_start(&tstart);
25 // stream triad loop
26 #pragma omp target teams distribute \ ❷
parallel for simd \ ❷
27 map(a[0:nsize], b[0:nsize],
c[0:nsize]) ❷
28 for (int i=0; i<nsize; i++){
29 c[i] = a[i] + scalar*b[i];
30 }
31 time_sum += cpu_timer_stop(tstart);
32 }
33
34 printf("Average runtime for stream triad loop is %lf secs\n",
time_sum/ntimes);
35
36 free(a);
37 free(b);
38 free(c);OpenMP/StreamTriad/StreamTriad_par2.c
9 double* restrict a =
malloc(nsize * sizeof(double)); ❶
10 double* restrict b =
malloc(nsize * sizeof(double)); ❶
11 double* restrict c =
malloc(nsize * sizeof(double)); ❶
12
13 struct timespec tstart;
14 // initializing data and arrays
15 double scalar = 3.0, time_sum = 0.0;
16 #pragma omp target teams distribute \ ❷
parallel for simd \ ❷
17 map(a[0:nsize], b[0:nsize],
c[0:nsize]) ❷
18 for (int i=0; i<nsize; i++) {
19 a[i] = 1.0;
20 b[i] = 2.0;
21 }
22
23 for (int k=0; k<ntimes; k++){
24 cpu_timer_start(&tstart);
25 // stream triad loop
26 #pragma omp target teams distribute \ ❷
parallel for simd \ ❷
27 map(a[0:nsize], b[0:nsize],
c[0:nsize]) ❷
28 for (int i=0; i<nsize; i++){
29 c[i] = a[i] + scalar*b[i];
30 }
31 time_sum += cpu_timer_stop(tstart);
32 }
33
34 printf("Average runtime for stream triad loop is %lf secs\n",
time_sum/ntimes);
35
36 free(a);
37 free(b);
38 free(c);
❶ Dynamically allocated memory
❷ Parallel work directive for heap allocated memory
请注意,第 16 行和第 26 行添加了 map 子句。如果尝试不带此子句的指令,尽管它与 IBM XLC 编译器一起编译正常,但在运行时,您将收到以下消息:
Note that lines 16 and 26 have added the map clause. If you try the directive without this clause, although it compiles fine with the IBM XLC compiler, at run time you’ll get this message:
1587-164 Encountered a zero-length array section that points to memory starting at address 0x200020000010. Because this memory is not currently mapped on the target device 0, a NULL pointer will be passed to the device. 1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered". 1587-163 Error encountered while attempting to execute on the target device 0. The program will stop.
1587-164 Encountered a zero-length array section that points to memory starting at address 0x200020000010. Because this memory is not currently mapped on the target device 0, a NULL pointer will be passed to the device. 1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered". 1587-163 Error encountered while attempting to execute on the target device 0. The program will stop.
但是,GCC 编译器在没有 map 指令的情况下都可以正常编译和运行。因此,GCC 编译器将堆分配的内存移动到设备,而 IBM XLC 则不会。为了实现可移植性,我们应该在应用程序代码中包含 map 子句。
The GCC compiler, however, both compiles and runs fine without the map directive. Thus, the GCC compiler moves heap allocated memory over to the device while IBM XLC does not. For portability, we should include the map clause in our application code.
OpenMP 还具有 parallel work-region 指令的 reduction 子句。该语法类似于线程 OpenMP 工作指令和 OpenACC 的语法。该指令的示例如下:
OpenMP also has the reduction clause for the parallel work-region directives. The syntax is similar to that for the threaded OpenMP work directives and OpenACC. An example of the directive is as follows:
#pragma omp teams distribute parallel for simd reduction(+:sum)
#pragma omp teams distribute parallel for simd reduction(+:sum)
现在我们已经将工作转移到 GPU,我们可以添加数据区域来管理进出 GPU 的数据移动。OpenMP 中的数据移动指令类似于 OpenACC 中的数据移动指令,具有结构化和动态版本。指令的形式是
Now that we have the work moved over to the GPU, we can add data regions to manage the data movement to and from the GPU. The data movement directives in OpenMP are similar to those in OpenACC with both a structured and dynamic version. The form of the directive is
#pragma omp target data [ map() | use_device_ptr() ]
#pragma omp target data [ map() | use_device_ptr() ]
工作指令包装在结构化数据区域中,如清单 11.15 所示。数据将复制到 GPU(如果尚未复制到 GPU)。然后,数据将一直保留到块的末尾(第 35 行)并复制回来。这大大减少了每个并行工作循环的数据传输,并且应该会带来整个应用程序运行时间的净加速。
The work directives are wrapped in the structured data region as listing 11.15 shows. The data is copied over to the GPU, if not already there. The data is then maintained there until the end of the block (at line 35) and copied back. This greatly reduces the data transfers for every parallel work loop and should result in a net speedup in the overall application run time.
清单 11.15 添加 OpenMP 编译指示以在 GPU 上创建结构化数据区域
Listing 11.15 Adding OpenMP pragmas to create a structured data region on the GPU
OpenMP/StreamTriad/StreamTriad_par3.c 17 #pragma omp target data map(to:a[0:nsize], \ ❶ b[0:nsize], c[0:nsize]) ❶ 18 { ❶ 19 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 20 for (int i=0; i<nsize; i++) { 21 a[i] = 1.0; 22 b[i] = 2.0; 23 } 24 25 for (int k=0; k<ntimes; k++){ 26 cpu_timer_start(&tstart); 27 // stream triad loop 28 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 29 for (int i=0; i<nsize; i++){ 30 c[i] = a[i] + scalar*b[i]; 31 } 32 time_sum += cpu_timer_stop(tstart); 33 } 34 35 } ❶
OpenMP/StreamTriad/StreamTriad_par3.c 17 #pragma omp target data map(to:a[0:nsize], \ ❶ b[0:nsize], c[0:nsize]) ❶ 18 { ❶ 19 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 20 for (int i=0; i<nsize; i++) { 21 a[i] = 1.0; 22 b[i] = 2.0; 23 } 24 25 for (int k=0; k<ntimes; k++){ 26 cpu_timer_start(&tstart); 27 // stream triad loop 28 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 29 for (int i=0; i<nsize; i++){ 30 c[i] = a[i] + scalar*b[i]; 31 } 32 time_sum += cpu_timer_stop(tstart); 33 } 34 35 } ❶
❶ Structured data region directive
结构化数据区域无法处理更多的通用编程模式。OpenACC 和 OpenMP(版本 4.5)都添加了动态数据区域,通常称为非结构化数据区域。该指令的形式具有 enter 和 exit 子句,带有 map 修饰符,用于指定数据传输操作(例如默认的 to 和 from):
Structured data regions cannot handle more general-programming patterns. Both OpenACC and OpenMP (version 4.5) added dynamic data regions, often referred to as unstructured data regions. The form for the directive has enter and exit clauses with a map modifier to specify the data transfer operation (such as the defaults to and from):
#pragma omp target enter data map([alloc | to]:array[[start]:[length]])
#pragma omp target exit data map([from | release | delete]:
array[[start]:[length]])#pragma omp target enter data map([alloc | to]:array[[start]:[length]])
#pragma omp target exit data map([from | release | delete]:
array[[start]:[length]])
在示例 11.16 中,我们将 omp target data 指令转换为 omp target enter data 指令(第 13 行)。GPU 上的数据范围在遇到 omp 目标退出数据指令(第 36 行)时结束。这些指令的效果与 Listing 11.15 中的结构化数据区域相同。但动态数据区域可用于更复杂的数据管理方案,如 C++ 中的构造函数和析构函数。
In listing 11.16, we convert the omp target data directive to omp target enter data directive (line 13). The scope of the data on the GPU concludes when it encounters an omp target exit data directive (line 36). The effect of these directives is the same as the structured data region in listing 11.15. But the dynamic data region can be used in more complex data management scenarios like constructors and destructors in C++.
Listing 11.16 使用动态 OpenMP 数据区域
Listing 11.16 Using a dynamic OpenMP data region
OpenMP/StreamTriad/StreamTriad_par4.c 13 #pragma omp target enter data \ ❶ map(to:a[0:nsize], b[0:nsize], c[0:nsize]) ❶ 14 15 struct timespec tstart; 16 // initializing data and arrays 17 double scalar = 3.0, time_sum = 0.0; 18 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 19 for (int i=0; i<nsize; i++) { 20 a[i] = 1.0; 21 b[i] = 2.0; 22 } 23 24 for (int k=0; k<ntimes; k++){ 25 cpu_timer_start(&tstart); 26 // stream triad loop 27 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 28 for (int i=0; i<nsize; i++){ 29 c[i] = a[i] + scalar*b[i]; 30 } 31 time_sum += cpu_timer_stop(tstart); 32 } 33 34 printf("Average runtime for stream triad loop is %lf msecs\n", time_sum/ntimes); 35 36 #pragma omp target exit data \ ❸ map(from:a[0:nsize], b[0:nsize], c[0:nsize]) ❸
OpenMP/StreamTriad/StreamTriad_par4.c 13 #pragma omp target enter data \ ❶ map(to:a[0:nsize], b[0:nsize], c[0:nsize]) ❶ 14 15 struct timespec tstart; 16 // initializing data and arrays 17 double scalar = 3.0, time_sum = 0.0; 18 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 19 for (int i=0; i<nsize; i++) { 20 a[i] = 1.0; 21 b[i] = 2.0; 22 } 23 24 for (int k=0; k<ntimes; k++){ 25 cpu_timer_start(&tstart); 26 // stream triad loop 27 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 28 for (int i=0; i<nsize; i++){ 29 c[i] = a[i] + scalar*b[i]; 30 } 31 time_sum += cpu_timer_stop(tstart); 32 } 33 34 printf("Average runtime for stream triad loop is %lf msecs\n", time_sum/ntimes); 35 36 #pragma omp target exit data \ ❸ map(from:a[0:nsize], b[0:nsize], c[0:nsize]) ❸
❶ Starts dynamic data region directive
❸ Ends dynamic data region directive
我们可以通过在设备上分配并在退出数据区域时删除数组来进一步优化数据传输,从而消除另一次数据传输。当需要传输以在 CPU 和 GPU 之间来回移动数据时,您可以使用 omp target update 指令。该指令的语法为
We can further optimize the data transfers by allocating on the device and deleting the arrays on exit from the data region, thereby eliminating another data transfer. When transfers are needed to move data back and forth from the CPU and the GPU, you can use the omp target update directive. The syntax for the directive is
#pragma omp target update [to | from] (array[start:length])
#pragma omp target update [to | from] (array[start:length])
我们还应该认识到,在此示例中, CPU 从未使用数组内存。对于仅存在于 GPU 上的内存,我们可以在那里分配它,然后告诉并行工作区域它已经在那里。我们有几种方法可以做到这一点。一种是使用 OpenMP 函数调用来分配和释放设备上的内存。这些调用如下所示,并且需要包含 OpenMP 头文件:
We should also recognize that in this example the CPU never uses the array memory. For memory that only exists on the GPU, we can allocate it there and then tell the parallel work regions that it is already there. There are a couple of ways we can do this. One is to use an OpenMP function call to allocate and free memory on the device. These calls look like the following and require the inclusion of the OpenMP header file:
#include <omp.h> double *a = omp_target_alloc(nsize*sizeof(double), omp_get_default_device()); omp_target_free(a, omp_get_default_device());
#include <omp.h> double *a = omp_target_alloc(nsize*sizeof(double), omp_get_default_device()); omp_target_free(a, omp_get_default_device());
我们还可以使用 CUDA 内存分配例程。我们需要包含 CUDA 运行时头文件才能使用这些例程:
We could also use the CUDA memory allocation routines. We need to include the CUDA run-time header file to use these routines:
#include <cuda_runtime.h> cudaMalloc((void *)&a,nsize*sizeof(double)); cudaFree(a);
#include <cuda_runtime.h> cudaMalloc((void *)&a,nsize*sizeof(double)); cudaFree(a);
在并行工作指令上,我们需要添加另一个子句,以将设备指针传递给设备上的内核:
On the parallel work directives, we then need to add another clause to pass the device pointers to the kernels on the device:
#pragma omp target teams distribute parallel for is_device_ptr(a)
#pragma omp target teams distribute parallel for is_device_ptr(a)
将所有这些放在一起,我们最终会得到对代码的更改,如下面的清单所示。
Putting this all together, we end up with the changes to the code shown in the following listing.
Listing 11.17 Creating arrays only on the GPU
OpenMP/StreamTriad/StreamTriad_par6.c
11 double *a = omp_target_alloc(nsize*sizeof(double),
omp_get_default_device());
12 double *b = omp_target_alloc(nsize*sizeof(double),
omp_get_default_device());
13 double *c = omp_target_alloc(nsize*sizeof(double),
omp_get_default_device());
14
15 struct timespec tstart;
16 // initializing data and arrays
17 double scalar = 3.0, time_sum = 0.0;
18 #pragma omp target teams distribute
parallel for simd is_device_ptr(a, b, c)
19 for (int i=0; i<nsize; i++) {
20 a[i] = 1.0;
21 b[i] = 2.0;
22 }
23
24 for (int k=0; k<ntimes; k++){
25 cpu_timer_start(&tstart);
26 // stream triad loop
27 #pragma omp target teams distribute \
parallel for simd is_device_ptr(a, b, c)
28 for (int i=0; i<nsize; i++){
29 c[i] = a[i] + scalar*b[i];
30 }
31 time_sum += cpu_timer_stop(tstart);
32 }
33
34 printf("Average runtime for stream triad loop is %lf msecs\n",
time_sum/ntimes);
35
36 omp_target_free(a, omp_get_default_device());
37 omp_target_free(b, omp_get_default_device());
38 omp_target_free(c, omp_get_default_device());OpenMP/StreamTriad/StreamTriad_par6.c
11 double *a = omp_target_alloc(nsize*sizeof(double),
omp_get_default_device());
12 double *b = omp_target_alloc(nsize*sizeof(double),
omp_get_default_device());
13 double *c = omp_target_alloc(nsize*sizeof(double),
omp_get_default_device());
14
15 struct timespec tstart;
16 // initializing data and arrays
17 double scalar = 3.0, time_sum = 0.0;
18 #pragma omp target teams distribute
parallel for simd is_device_ptr(a, b, c)
19 for (int i=0; i<nsize; i++) {
20 a[i] = 1.0;
21 b[i] = 2.0;
22 }
23
24 for (int k=0; k<ntimes; k++){
25 cpu_timer_start(&tstart);
26 // stream triad loop
27 #pragma omp target teams distribute \
parallel for simd is_device_ptr(a, b, c)
28 for (int i=0; i<nsize; i++){
29 c[i] = a[i] + scalar*b[i];
30 }
31 time_sum += cpu_timer_stop(tstart);
32 }
33
34 printf("Average runtime for stream triad loop is %lf msecs\n",
time_sum/ntimes);
35
36 omp_target_free(a, omp_get_default_device());
37 omp_target_free(b, omp_get_default_device());
38 omp_target_free(c, omp_get_default_device());
OpenMP 有另一种在设备上分配数据的方法。此方法使用 omp declare target 指令,如清单 11.18 所示。我们首先在第 10-12 行声明指向数组的指针,然后使用以下代码块(第 14-19 行)在设备上分配这些指针。第 42-47 行使用类似的块来释放设备上的数据。
OpenMP has another way to allocate data on the device. This method uses the omp declare target directive as shown in listing 11.18. We first declare the pointers to the array on lines 10-12 and then allocate these on the device with the following block of code (lines 14-19). A similar block is used on lines 42-47 for freeing the data on the device.
清单 11.18 使用 omp declare 仅在 GPU 上创建数组
Listing 11.18 Using omp declare to create arrays only on the GPU
OpenMP/StreamTriad/StreamTriad_par8.c 10 #pragma omp declare target ❶ 11 double *a, *b, *c; ❶ 12 #pragma omp end declare target ❶ 13 14 #pragma omp target ❷ 15 { ❷ 16 a = malloc(nsize* sizeof(double); ❷ 17 b = malloc(nsize* sizeof(double); ❷ 18 c = malloc(nsize* sizeof(double); ❷ 19 } ❷ < unchanged code> 42 #pragma omp target ❸ 43 { ❸ 44 free(a); ❸ 45 free(b); ❸ 46 free(c); ❸ 47 } ❸
OpenMP/StreamTriad/StreamTriad_par8.c 10 #pragma omp declare target ❶ 11 double *a, *b, *c; ❶ 12 #pragma omp end declare target ❶ 13 14 #pragma omp target ❷ 15 { ❷ 16 a = malloc(nsize* sizeof(double); ❷ 17 b = malloc(nsize* sizeof(double); ❷ 18 c = malloc(nsize* sizeof(double); ❷ 19 } ❷ < unchanged code> 42 #pragma omp target ❸ 43 { ❸ 44 free(a); ❸ 45 free(b); ❸ 46 free(c); ❸ 47 } ❸
❶ Declares target creates a pointer on the device
❷ Allocates data on the device
正如我们所看到的,GPU 的数据管理有很多不同的选项。现在,我们已经介绍了 OpenMP 中最常见的数据区域指令和子句。OpenMP 标准的最新成员可处理更复杂的数据结构和数据传输。
As we have seen, there are a lot of different options for data management for the GPU. We now have covered the most common data region directives and clauses in OpenMP. Recent additions to the OpenMP standard handle more complicated data structures and data transfers.
让我们切换到内核优化的模板示例,就像我们对 OpenACC 所做的那样。您可以尝试一些方法来加速单个内核,但在大多数情况下,出于可移植性的原因,最好让编译器进行优化。模板内核的核心部分(包含以下列表中的 OpenMP 数据和工作区域)是优化工作的起点。
Let’s switch to a stencil example for the kernel optimization like we did for OpenACC. There are a few things you can try for speeding up individual kernels, but for the most part, it is best to let the compiler do the optimization for portability reasons. The core part of the stencil kernel with the OpenMP data and work regions in the following listing is the starting point for the optimization work.
Listing 11.19 Initial OpenMP version of stencil
OpenMP/Stencil/Stencil_par2.c 15 double** restrict x = malloc2D(jmax, imax); 16 double** restrict xnew = malloc2D(jmax, imax); 17 18 #pragma omp target enter data \ ❶ map(to:x[0:jmax][0:imax], \ ❶ xnew[0:jmax][0:imax]) ❶ 19 20 #pragma omp target teams ❷ 21 { ❷ 22 #pragma omp distribute parallel for simd ❷ 23 for (int j = 0; j < jmax; j++){ 24 for (int i = 0; i < imax; i++){ 25 xnew[j][i] = 0.0; 26 x[j][i] = 5.0; 27 } 28 } 29 30 #pragma omp distribute parallel for simd ❷ 31 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){ 32 for (int i = imax/2 - 5; i < imax/2 -1; i++){ 33 x[j][i] = 400.0; 34 } 35 } 36 } // omp target teams ❷ 37 38 for (int iter = 0; iter < niter; iter+=nburst){ 39 40 for (int ib = 0; ib < nburst; ib++){ 41 cpu_timer_start(&tstart_cpu); 42 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 43 for (int j = 1; j < jmax-1; j++){ ❸ 44 for (int i = 1; i < imax-1; i++){ ❸ 45 xnew[j][i]=(x[j][i]+ ❸ x[j][i-1]+x[j][i+1]+ ❸ x[j-1][i]+x[j+1][i])/5.0; ❸ 46 } ❸ 47 } ❸ 48 49 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 50 for (int j = 0; j < jmax; j++){ ❹ 51 for (int i = 0; i < imax; i++){ ❹ 52 x[j][i] = xnew[j][i]; ❹ 53 } ❹ 54 } ❹ 55 cpu_time += cpu_timer_stop(tstart_cpu); 56 57 } 58 59 printf("Iter %d\n",iter+nburst); 60 } 61 62 #pragma omp target exit data \ ❺ map(from:x[0:jmax][0:imax], \ ❺ xnew[0:jmax][0:imax]) ❺ 63 64 free(x); 65 free(xnew);
OpenMP/Stencil/Stencil_par2.c 15 double** restrict x = malloc2D(jmax, imax); 16 double** restrict xnew = malloc2D(jmax, imax); 17 18 #pragma omp target enter data \ ❶ map(to:x[0:jmax][0:imax], \ ❶ xnew[0:jmax][0:imax]) ❶ 19 20 #pragma omp target teams ❷ 21 { ❷ 22 #pragma omp distribute parallel for simd ❷ 23 for (int j = 0; j < jmax; j++){ 24 for (int i = 0; i < imax; i++){ 25 xnew[j][i] = 0.0; 26 x[j][i] = 5.0; 27 } 28 } 29 30 #pragma omp distribute parallel for simd ❷ 31 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){ 32 for (int i = imax/2 - 5; i < imax/2 -1; i++){ 33 x[j][i] = 400.0; 34 } 35 } 36 } // omp target teams ❷ 37 38 for (int iter = 0; iter < niter; iter+=nburst){ 39 40 for (int ib = 0; ib < nburst; ib++){ 41 cpu_timer_start(&tstart_cpu); 42 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 43 for (int j = 1; j < jmax-1; j++){ ❸ 44 for (int i = 1; i < imax-1; i++){ ❸ 45 xnew[j][i]=(x[j][i]+ ❸ x[j][i-1]+x[j][i+1]+ ❸ x[j-1][i]+x[j+1][i])/5.0; ❸ 46 } ❸ 47 } ❸ 48 49 #pragma omp target teams distribute \ ❷ parallel for simd ❷ 50 for (int j = 0; j < jmax; j++){ ❹ 51 for (int i = 0; i < imax; i++){ ❹ 52 x[j][i] = xnew[j][i]; ❹ 53 } ❹ 54 } ❹ 55 cpu_time += cpu_timer_stop(tstart_cpu); 56 57 } 58 59 printf("Iter %d\n",iter+nburst); 60 } 61 62 #pragma omp target exit data \ ❺ map(from:x[0:jmax][0:imax], \ ❺ xnew[0:jmax][0:imax]) ❺ 63 64 free(x); 65 free(xnew);
❹ Replaces swap with copy from new back to original
简单地为 2D 循环和数据结构添加单个工作指令不足以为 IBM XL 编译器版本 16 的 GPU 有效地生成工作。运行时间几乎是串行版本的两倍(请参见本节末尾的表 11.4)。您可以使用 nvprof 查找时间花费的位置。这是输出:
Simply adding a single work directive for the 2D loop and the data construct is not enough to get the work efficiently generated for the GPU for version 16 of the IBM XL compiler. The run time is nearly twice as long as the serial version (see table 11.4 at the end of this section). You can use nvprof to find where the time is being spent. Here’s the output:
第一行显示第三个内核占用了超过 50% 的运行时间。复制回原始阵列会额外占用 48% 的运行时间。是内核代码而不是数据传输导致了问题!要纠正这个问题,首先要尝试将两个嵌套的循环折叠成一个并行结构。对此的更改包括在 work 指令上添加 collapse 子句以及要折叠的循环数。这显示在下一个清单的第 22、30、42 和 49 行。
The first line shows that the third kernel is taking up more than 50% of the run time. The copy back to the original array is taking an additional 48% of the run time. It’s the kernel code and not the data transfer that is causing the problem! To correct this, the first thing to try is to collapse the two nested loops into a single parallel construct. The changes for this include adding the collapse clause along with the number of loops to collapse on the work directives. This is shown on lines 22, 30, 42, and 49 in the next listing.
Listing 11.20 Using collapse for optimization
OpenMP/Stencil/Stencil_par3.c
20 #pragma omp target teams
21 {
22 #pragma omp distribute parallel \ ❶
for simd collapse(2) ❶
23 for (int j = 0; j < jmax; j++){
24 for (int i = 0; i < imax; i++){
25 xnew[j][i] = 0.0;
26 x[j][i] = 5.0;
27 }
28 }
29
30 #pragma omp distribute parallel \ ❶
for simd collapse(2) ❶
31 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
32 for (int i = imax/2 - 5; i < imax/2 -1; i++){
33 x[j][i] = 400.0;
34 }
35 }
36 }
37
38 for (int iter = 0; iter < niter; iter+=nburst){
39
40 for (int ib = 0; ib < nburst; ib++){
41 cpu_timer_start(&tstart_cpu);
42 #pragma omp target teams distribute \ ❶
parallel for simd collapse(2) ❶
43 for (int j = 1; j < jmax-1; j++){
44 for (int i = 1; i < imax-1; i++){
45 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
46 }
47 }
48
49 #pragma omp target teams distribute \ ❶
parallel for simd collapse(2) ❶
50 for (int j = 0; j < jmax; j++){
51 for (int i = 0; i < imax; i++){
52 x[j][i] = xnew[j][i];
53 }
54 }
55 cpu_time += cpu_timer_stop(tstart_cpu);
56
57 }
58
59 printf("Iter %d\n",iter+nburst);
60 }OpenMP/Stencil/Stencil_par3.c
20 #pragma omp target teams
21 {
22 #pragma omp distribute parallel \ ❶
for simd collapse(2) ❶
23 for (int j = 0; j < jmax; j++){
24 for (int i = 0; i < imax; i++){
25 xnew[j][i] = 0.0;
26 x[j][i] = 5.0;
27 }
28 }
29
30 #pragma omp distribute parallel \ ❶
for simd collapse(2) ❶
31 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
32 for (int i = imax/2 - 5; i < imax/2 -1; i++){
33 x[j][i] = 400.0;
34 }
35 }
36 }
37
38 for (int iter = 0; iter < niter; iter+=nburst){
39
40 for (int ib = 0; ib < nburst; ib++){
41 cpu_timer_start(&tstart_cpu);
42 #pragma omp target teams distribute \ ❶
parallel for simd collapse(2) ❶
43 for (int j = 1; j < jmax-1; j++){
44 for (int i = 1; i < imax-1; i++){
45 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
46 }
47 }
48
49 #pragma omp target teams distribute \ ❶
parallel for simd collapse(2) ❶
50 for (int j = 0; j < jmax; j++){
51 for (int i = 0; i < imax; i++){
52 x[j][i] = xnew[j][i];
53 }
54 }
55 cpu_time += cpu_timer_stop(tstart_cpu);
56
57 }
58
59 printf("Iter %d\n",iter+nburst);
60 }
现在,运行时间比 CPU 快(见表 11.3),但不如 PGI OpenACC 编译器生成的版本快(表 11.1)。我们预计,随着 IBM XL 编译器的改进,这种情况应该会变得更好。让我们尝试另一种方法,将并行工作指令拆分到两个循环中,如下面的清单所示。
The run time is now faster than the CPU (see table 11.3), though not as fast as the version generated by the PGI OpenACC compiler (table 11.1). We expect that as the IBM XL compiler improves, this should get better. Let’s try another approach of splitting the parallel work directives across the two loops as shown in the following listing.
Listing 11.21 Splitting work directives for optimization
OpenMP/Stencil/Stencil_par4.c
20 #pragma omp target teams
21 {
22 #pragma omp distribute ❶
23 for (int j = 0; j < jmax; j++){
24 #pragma omp parallel for simd ❶
25 for (int i = 0; i < imax; i++){
26 xnew[j][i] = 0.0;
27 x[j][i] = 5.0;
28 }
29 }
30
31 #pragma omp distribute ❶
32 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
33 #pragma omp parallel for simd ❶
34 for (int i = imax/2 - 5; i < imax/2 -1; i++){
35 x[j][i] = 400.0;
36 }
37 }
38 }
39
40 for (int iter = 0; iter < niter; iter+=nburst){
41
42 for (int ib = 0; ib < nburst; ib++){
43 cpu_timer_start(&tstart_cpu);
44 #pragma omp target teams distribute ❶
45 for (int j = 1; j < jmax-1; j++){
46 #pragma omp parallel for simd ❶
47 for (int i = 1; i < imax-1; i++){
48 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
49 }
50 }
51
52 #pragma omp target teams distribute ❶
53 for (int j = 0; j < jmax; j++){
54 #pragma omp parallel for simd ❶
55 for (int i = 0; i < imax; i++){
56 x[j][i] = xnew[j][i];
57 }
58 }
59 cpu_time += cpu_timer_stop(tstart_cpu);
60
61 }
62
63 printf("Iter %d\n",iter+nburst);
64 }OpenMP/Stencil/Stencil_par4.c
20 #pragma omp target teams
21 {
22 #pragma omp distribute ❶
23 for (int j = 0; j < jmax; j++){
24 #pragma omp parallel for simd ❶
25 for (int i = 0; i < imax; i++){
26 xnew[j][i] = 0.0;
27 x[j][i] = 5.0;
28 }
29 }
30
31 #pragma omp distribute ❶
32 for (int j = jmax/2 - 5; j < jmax/2 + 5; j++){
33 #pragma omp parallel for simd ❶
34 for (int i = imax/2 - 5; i < imax/2 -1; i++){
35 x[j][i] = 400.0;
36 }
37 }
38 }
39
40 for (int iter = 0; iter < niter; iter+=nburst){
41
42 for (int ib = 0; ib < nburst; ib++){
43 cpu_timer_start(&tstart_cpu);
44 #pragma omp target teams distribute ❶
45 for (int j = 1; j < jmax-1; j++){
46 #pragma omp parallel for simd ❶
47 for (int i = 1; i < imax-1; i++){
48 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+
x[j-1][i]+x[j+1][i])/5.0;
49 }
50 }
51
52 #pragma omp target teams distribute ❶
53 for (int j = 0; j < jmax; j++){
54 #pragma omp parallel for simd ❶
55 for (int i = 0; i < imax; i++){
56 x[j][i] = xnew[j][i];
57 }
58 }
59 cpu_time += cpu_timer_stop(tstart_cpu);
60
61 }
62
63 printf("Iter %d\n",iter+nburst);
64 }
❶ Splits work over two loop levels
来自 IBM XL 编译器的 split parallel work 指令的计时类似于 collapse 子句。表 11.3 显示了我们的内核优化实验结果。
The timing from the IBM XL compiler for the split parallel work directives is similar to the collapse clause. Table 11.3 shows the results of our experiments with kernel optimizations.
Table 11.3 Run times from OpenMP stencil kernel optimizations
我们还查看了表 11.4 中 IBM XL 编译器 v16 在配备 NVIDIA V100 GPU 的 Power 9 处理器上流三元组示例的运行时结果。CPU 的性能是不同的,因为在一个案例中,我们使用了 Intel Skylake 处理器,而在这种情况下,我们使用的是 Power 9 处理器。但令人鼓舞的是,在 V100 GPU 上使用 OpenMP 的流内核的性能与表 11.2 中 PGI OpenACC 编译器的性能基本相同。
We also look at the run time results for the stream triad example from the IBM XL compiler v16 on a Power 9 processor with an NVIDIA V100 GPU in table 11.4. The performance on the CPU is different because, in one case, we used an Intel Skylake processor and, in this case, we are using a Power 9 processor. But it is encouraging to see that the performance of the stream kernel with OpenMP on the V100 GPU is essentially the same as that for the PGI OpenACC compiler in table 11.2.
Table 11.4 Run times from OpenMP stream triad kernel optimizations
使用 IBM XL 编译器的 OpenMP 在简单的 1D 测试问题上性能良好,但对于 2D 模板情况有待改进。到目前为止,重点一直是正确实施设备卸载的 OpenMP 标准。我们预计,随着每个编译器版本以及更多编译器供应商提供 OpenMP 设备卸载支持,性能将得到提高。
The performance of OpenMP with the IBM XL compiler is good on a simple 1D test problem but could be improved for the 2D stencil case. The focus thus far has been on correctly implementing the OpenMP standard for device offloading. We expect that performance will improve with each compiler release and with more compiler vendors offering OpenMP device offloading support.
OpenMP 还有许多其他高级功能。OpenMP 也根据 GPU 上早期实施的经验和硬件的不断发展而变化。我们将仅介绍一些重要的预先指示和条款
OpenMP has many additional advanced capabilities. OpenMP is also changing based on the experience with the early implementations on GPUs and as hardware continues to evolve. We’ll cover just a few of the advanced directives and clauses that are important for
Handling various important programming constructs (functions, scans, and shared access to variables)
Asynchronous operations that overlap data movement and computation
Controlling the GPU kernel parameters implemented by the OpenMP compiler
我们首先查看可用于微调内核性能的子句。我们可以将这些子句添加到指令中,以修改编译器为 GPU 生成的内核:
We start by looking at clauses that can be used to fine-tune kernel performance. We can add these clauses to directives to modify the kernels that the compiler generates for the GPU:
num_teams defines the number of teams generated by the teams directive.
schedule 或 schedule(static,1) 指定以循环方式分发工作项,而不是以块的形式分发。这有助于在 GPU 上合并内存负载。
schedule or schedule(static,1) specifies that the work items are distributed in a round-robin manner rather than in a block. This can help with memory load coalescing on the GPU.
simdlen specifies the vector length or threads for the workgroup.
这些子句在特殊情况下可能很有用,但通常,最好将参数留给编译器进行优化。
These clauses can be useful in special situations, but in general, it is better to leave the parameters for the compiler to optimize.
Declaring an OpenMP device function
当我们在器件的 parallel region 内调用函数时,我们需要一种方法来告诉编译器它也应该在器件上。这是通过向函数添加 declare target 指令来完成的。语法类似于变量声明的语法。下面是一个示例:
When we call a function within a parallel region on the device, we need a way to tell the compiler it should also be on the device. This is done by adding a declare target directive to the function. The syntax is similar to that for variable declarations. Here is an example:
#pragma omp declare target
int my_compute(<args>){
<work>
}#pragma omp declare target
int my_compute(<args>){
<work>
}
我们在 5.6 节中讨论了 scan 算法的重要性,其中我们还看到了在 GPU 上实现该算法的复杂性。这是并行计算中普遍存在的操作,编写起来很复杂,因此添加此类型很有帮助。扫描类型将在 OpenMP 版本 5.0 中提供。
We discussed the importance of the scan algorithm in section 5.6, where we also saw the complexity of implementing this algorithm on the GPU. This is a ubiquitous operation in parallel computing and complicated to write, so the addition of this type is helpful. The scan type will be available in version 5.0 of OpenMP.
int run_sum = 0;
#pragma omp parallel for simd reduction(inscan,+: run_sum)
for (int i = 0; i < n; ++i) {
run_sum += ncells[i];
#pragma omp scan exclusive(run_sum)
cell_start[i] = run_sum;
#pragma omp scan inclusive(run_sum)
cell_end[i] = run_sum;
} int run_sum = 0;
#pragma omp parallel for simd reduction(inscan,+: run_sum)
for (int i = 0; i < n; ++i) {
run_sum += ncells[i];
#pragma omp scan exclusive(run_sum)
cell_start[i] = run_sum;
#pragma omp scan inclusive(run_sum)
cell_end[i] = run_sum;
}
Preventing race conditions with OpenMP Atomic
在算法中,多个线程访问一个公共变量是正常的。它通常是例程性能的瓶颈。Atomics 在各种编译器和线程实现中提供了此功能。OpenMP 还提供了一个 atomic 指令。使用该指令的一个示例是
It is normal in an algorithm that several threads access a common variable. It is often a bottleneck in the performance of routines. Atomics have provided this functionality in various compilers and thread implementations. OpenMP also provides an atomic directive. An example of the use of the directive is
#pragma omp atomic i++;
#pragma omp atomic i++;
OpenMP’s version of asynchronous operations
在 Section 10.5 中,我们讨论了通过异步操作进行重叠数据传输和计算的价值。OpenMP 还提供了这些操作的其版本。
In section 10.5, we discussed the value of overlapping data transfer and computation through asynchronous operations. OpenMP also provides its version of these operations.
您可以在 data 或 work 指令上使用 nowait 子句创建异步设备操作。然后,您可以使用 depend 子句指定在上一个操作完成之前无法启动新操作。这些操作可以链接起来形成一系列操作。我们可以使用一个简单的 taskwait 指令来等待所有任务完成:
You create asynchronous device operations using the nowait clause on either a data or work directive. You can then use a depend clause to specify that a new operation cannot start until the previous operation is complete. These operations can be chained to form a sequence of operations. We can use a simple taskwait directive to wait for completion of all tasks:
#pragma omp taskwait
#pragma omp taskwait
Accessing special memory spaces
内存带宽通常是最重要的性能限制之一。对于基于 pragma 的语言,并不总是能够控制内存的位置和由此产生的内存带宽。为程序员提供更多控制权的功能是 OpenMP 最受期待的新增功能之一。使用 OpenMP 5.0,您将能够以特殊内存空间为目标,例如共享内存和高带宽内存。该功能是通过新的 allocator 子句修饰符实现的。allocate 子句采用可选修饰符,如下所示:
Memory bandwidth is often one of the most important performance limits. With pragma-based languages, it has not always been possible to control the placement of memory and the resulting memory bandwidth. The addition of features to give the programmer more control over this has been one of the more eagerly anticipated additions to OpenMP. With OpenMP 5.0, you will be able to target special memory spaces such as shared memory and high-bandwidth memory. The capability is through a new allocator clause modifier. The allocate clause takes an optional modifier as follows:
allocate([allocator:] list)
allocate([allocator:] list)
You can use the following pair of functions to directly allocate and free memory:
omp_alloc(size_t size, omp_allocator_t *allocator) omp_free(void *ptr, const omp_allocator_t *allocator)
omp_alloc(size_t size, omp_allocator_t *allocator) omp_free(void *ptr, const omp_allocator_t *allocator)
OpenMP 5.0 标准为分配器指定了一些预定义的内存空间,如下表所示。
The OpenMP 5.0 standard specifies some predefined memory spaces for allocators as this table shows.
A set of functions is available to define new memory allocators. The two main routines are
omp_init_allocator omp_destroy_allocator
omp_init_allocator omp_destroy_allocator
这些分配器采用预定义的 space 参数和分配器 trait 之一,例如是否应该 pinned、aligned、private、nearby 还是许多其他。此功能的实现仍在开发中。在新架构中,此功能将变得越来越重要,因为新架构中存在具有不同延迟和带宽性能特征的特殊内存类型。
These allocators take one of the predefined space arguments and allocator traits such as whether it should be pinned, aligned, private, nearby, or many others. Implementations of this capability are still under development. This functionality will be of increasing importance with new architectures, where there are special memory types with different latency and bandwidth performance characteristics.
Deep copy support for transferring complex data structures
OpenMP 5.0 还添加了一个可以执行深层复制的 declare mapper 结构。深层复制不仅复制带有指针的数据结构,还复制指针引用的数据。具有复杂数据结构和类的程序一直在努力解决移植到 GPU 的困难。执行深层复制的能力大大简化了这些实现。
OpenMP 5.0 also adds a declare mapper construct that can do deep copies. Deep copies not only duplicate a data structure with pointers but also the data referred to by the pointers. Programs with complex data structures and classes have struggled with the difficulty of porting to GPUs. The ability to do deep copies greatly simplifies these implementations.
Simplifying work distribution with the new loop directive
OpenMP 5.0 标准引入了更灵活的工作指令。其中之一是 loop 指令,它更简单、更接近 OpenACC 中的功能。loop 指令代替了 simd 的 distribute parallel。使用 loop 指令,您可以告诉编译器可以并发执行循环迭代,但将实际实现留给编译器。下面的清单显示了在 stencil kernel 中使用此指令的示例。
The OpenMP 5.0 standard introduces more flexible work directives. One of these is the loop directive that is simpler and closer to the functionality in OpenACC. The loop directive takes the place of distribute parallel for simd. With the loop directive, you are telling the compiler that the loop iterations can be executed concurrently, but you leave the actual implementation to the compiler. The following listing shows an example of using this directive in the stencil kernel.
清单 11.22 在 OpenMP 5.0 中使用 new loop 指令
Listing 11.22 Using the new loop directive in OpenMP 5.0
47 #pragma omp target teams ❶ 48 #pragma omp loop ❷ 49 for (int j = 1; j < jmax-1; j++){ 50 #pragma omp loop ❷ 51 for (int i = 1; i < imax-1; i++){ 52 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+ x[j-1][i]+x[j+1][i])/5.0; 53 } 54 }
47 #pragma omp target teams ❶ 48 #pragma omp loop ❷ 49 for (int j = 1; j < jmax-1; j++){ 50 #pragma omp loop ❷ 51 for (int i = 1; i < imax-1; i++){ 52 xnew[j][i]=(x[j][i]+x[j][i-1]+x[j][i+1]+ x[j-1][i]+x[j+1][i])/5.0; 53 } 54 }
❶ Launches work on the GPU with multiple teams
❷ The loop parallelized as independent work
loop 子句实际上是一个独立于循环的 concurrent 子句,它告诉编译器循环的迭代没有依赖关系。loop 子句为编译器提供信息或描述性子句,而不是告诉编译器要做什么,后者是规定性子句。大多数编译器尚未实现此新功能,因此我们将继续使用本章前面示例中的规范性子句。如果您不熟悉这些概念,以下是每个概念的定义:
The loop clause is really a loop independent or concurrent clause that tells the compiler that iterations of the loop have no dependencies. The loop clause gives the compiler information or a descriptive clause rather than telling the compiler what to do, which is a prescriptive clause. Most compilers have not implemented this new feature, so we continue to work with the prescriptive clauses in the earlier examples in this chapter. If you’re not familiar with these concepts, here’s a definition of each:
Prescriptive directives and clauses(规范性指令和子句)- 程序员提供的指令,用于具体告知编译器要做什么。
Prescriptive directives and clauses—Directives from the programmer that tell the compiler specifically what to do.
Descriptive directives and clauses(描述性指令和子句)- 为编译器提供有关以下循环构造的信息的指令;还为编译器提供了一些自由来生成最有效的 implementation。
Descriptive directives and clauses—Directives that give the compiler information about the following loop construct; also gives the compiler some freedom to generate the most efficient implementation.
OpenMP 传统上在其规范中使用规定性条款。这减少了实现之间的差异并提高了可移植性。但对于 GPU,它导致了冗长而复杂的指令,对于线程和其他硬件特定功能之间是否可以同步存在细微的差异。
OpenMP has traditionally used prescriptive clauses in its specifications. This reduces the variation between implementations and improves portability. But in the case of GPUs, it has led to the long, complex directives with subtle differentiation on whether synchronization is possible between threads and other hardware-specific features.
描述性方法更接近 OpenACC 理念,并且不受硬件细节的束缚。这为编译器提供了如何正确有效地为目标硬件生成代码的自由和责任。请注意,这不仅是 OpenMP 的重大转变,而且是重要的转变。如果 OpenMP 继续尝试走规范性指令的道路,那么随着硬件复杂性的持续增长,OpenMP 语言将变得过于复杂,代码的可移植性将降低。
The descriptive approach is closer to the OpenACC philosophy and is not so burdened with the details of the hardware. This gives the compiler both the freedom and responsibility of how to properly and effectively generate code for the targeted hardware. Note that this is not only a significant shift for OpenMP, but an important one. If OpenMP continues to try and go down the path of prescriptive directives, as hardware complexity continues to grow, the OpenMP language will grow too complicated and the portability of codes will be reduced.
OpenACC 和 OpenMP 都是大型语言,具有许多指令、子句、修饰符和函数。除了这些语言的核心功能之外,示例很少,文档也很稀疏。事实上,许多较少使用的部分可能无法在所有编译器中工作。在将新功能添加到大型应用程序之前,您应该先在一个小示例中测试新功能。要了解有关这些语言的更多信息,请参阅下面的其他阅读材料。此外,请务必获得 11.4.2 节中练习的一些实践经验。
Both OpenACC and OpenMP are large languages with many directives, clauses, modifiers, and functions. Beyond the core functionality of these languages, there are few examples and sparse documentation. Indeed, many of the lesser used parts may not work in all compilers. You should test new functionality in a small example before adding it to a large application. To learn more about these languages, refer to the additional reading materials that follow. Also, be sure and get some hands-on experience with the exercises in section 11.4.2.
由于 OpenACC 和 OpenMP 语言仍在不断发展,因此其他材料的最佳来源是相应的网站:https://openacc.org 和 https:// openmp.org。每个站点都列出了其他资源,包括教程和在领先的 HPC 会议上的演示。
Because the OpenACC and OpenMP languages are still evolving, the best sources for additional materials are at the respective websites: https://openacc.org and https:// openmp.org. Each site lists additional resources, including tutorials and presentations at leading HPC conferences.
OpenACC resources and references
OpenACC 的发布时间比 OpenMP 长一点,并且有更多的书籍和文档。该语言的起点是 OpenACC 标准。该标准的 3.0 版有 150 页,可读性很强,与最终用户相关。它可以在 openacc.org 网站上找到。以下 URL 提供了指向 OpenACC 应用程序编程接口 v3.0(2018 年 11 月)的链接:
OpenACC has been out a little longer than OpenMP and has more books and documentation. The starting place for the language is the OpenACC standard. At 150 pages, version 3.0 of the standard is very readable and relevant to the end user. It can be found on the openacc.org website. The following URL provides a link to The OpenACC Application Programming Interface, v3.0 (November, 2018):
https://www.openacc.org/sites/default/files/inline-images/Specification/ OpenACC.3.0.pdf。
https://www.openacc.org/sites/default/files/inline-images/Specification/ OpenACC.3.0.pdf.
OpenACC 站点也有一个关于编程和最佳实践的文档。它没有链接到标准的特定版本,但自 2015 年以来就没有更新过。您可以在此处找到 OpenACC-standard.org 的 OpenACC 编程和最佳实践指南(2015 年 6 月):
The OpenACC site also has a document on programming and best practices. It is not linked to a particular version of the standard, but has not been updated since 2015. You’ll find OpenACC-standard.org’s OpenACC Programming and Best Practices Guide (June, 2015) here:
https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_ Guide_0.pdf。
https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_ Guide_0.pdf.
The leading book for OpenACC is
Sunita Chandrasekaran 和 Guido Juckeland,面向程序员的 OpenACC:概念和策略(Addison-Wesley Professional,2017 年)。
Sunita Chandrasekaran and Guido Juckeland, OpenACC for Programmers: Concepts and Strategies (Addison-Wesley Professional, 2017).
OpenMP resources and references
OpenMP 的大多数书籍和指南都早于设备卸载功能,但语言规范详细描述了 OpenMP 设备卸载指令。它有 600 多页,与其说是用户指南,不如说是参考。尽管如此,它仍然是有关该语言功能的详细信息的首选文档。阿拉伯数字
Most of the books and guides to OpenMP predate device offloading capabilities, but the language specification thoroughly describes the OpenMP device offloading directives. At over 600 pages, it is more of a reference than a user’s guide. Still, it is the go-to document for details on the features of the language.2
OpenMP 架构审查委员会,OpenMP 应用程序编程接口,第 5.0 卷(2018 年 11 月),第 https://www.openmp.org/wp-content/uploads/Open MP-API-Specification-5.0.pdf 页。
OpenMP Architecture Review Board, OpenMP Application Programming Interface, Vol. 5.0 (November, 2018) at https://www.openmp.org/wp-content/uploads/Open MP-API-Specification-5.0.pdf.
规范的配套内容是示例指南。本指南提供了每个功能应该如何工作的简短示例,但不是完整的应用程序级案例:
A companion to the specification is the example guide. This guide gives short examples of how each feature should work, but not complete application-level cases:
OpenMP 架构审查委员会,OpenMP 应用程序编程接口:示例,第 5.0 卷(2019 年 11 月),第 https://www.openmp.org/wp-content/uploads/ openmp-examples-5.0.0.pdf。
OpenMP Architecture Review Board, OpenMP Application Programming Interface: Examples, Vol. 5.0 (November, 2019) at https://www.openmp.org/wp-content/uploads/ openmp-examples-5.0.0.pdf.
由于 OpenMP 仍然有重大变化,编译器仍在努力实现 v5.0 功能,因此很少有书籍讨论设备卸载功能也就不足为奇了。Ruud van der Pas 和其他人最近完成了一本书,介绍了 OpenMP 到 v4.5 的新功能。
With OpenMP still seeing significant changes and compilers still working on implementing v5.0 features, it is not surprising that there are few books that discuss the device offloading features. Ruud van der Pas and others recently completed a book that covers the new features of OpenMP up through v4.5.
Ruud van der Pas、Eric Stotzer 和 Christian Terboven,《使用 OpenMP — 下一步:亲和性、加速器、任务处理和 SIMD》(麻省理工学院出版社,2017 年)。
Ruud Van der Pas, Eric Stotzer, and Christian Terboven, Using OpenMP—The Next Step: Affinity, Accelerators, Tasking, and SIMD (MIT Press, 2017).
查找可用于本地 GPU 系统的编译器。OpenACC 和 OpenMP 编译器都可用吗?如果没有,您是否可以访问任何允许您尝试这些基于 pragma 的语言的系统?
Find what compilers are available for your local GPU system. Are both OpenACC and OpenMP compilers available? If not, do you have access to any systems that would allow you to try out these pragma-based languages?
从本地 GPU 开发系统上的 OpenACC/StreamTriad 和/或 OpenMP/StreamTriad 目录运行流三元组示例。您可以在 https://github.com/EssentialsofParallelComputing/ Chapter11 中找到这些目录。
Run the stream triad examples from the OpenACC/StreamTriad and/or the OpenMP/StreamTriad directories on your local GPU development system. You’ll find these directories at https://github.com/EssentialsofParallelComputing/ Chapter11.
将练习 2 的结果与 https:// uob-hpc.github.io/BabelStream/results/ 的 BabelStream 结果进行比较。对于流三元组,移动的字节数为 3 * nsize * sizeof(datatype)。
Compare your results from exercise 2 to the BabelStream results at https:// uob-hpc.github.io/BabelStream/results/. For the stream triad, the bytes moved are 3 * nsize * sizeof(datatype).
Modify the OpenMP data region mapping in listing 11.16 to reflect the actual use of the arrays in the kernels.
对于大小为 20,000,000 的 x 和 y 数组,请同时使用 OpenMP 和 OpenACC 查找数组的最大半径。使用双精度值初始化数组,x 数组从 1.0 线性增加到 2.0e7,y 数组从 2.0e7 线性增加到 1.0。
For x and y arrays of size 20,000,000, find the maximum radius for the arrays using both OpenMP and OpenACC. Initialize the arrays with double-precision values that linearly increase from 1.0 to 2.0e7 for the x array and decrease from 2.0e7 to 1.0 for the y array.
Pragma-based languages are the easiest way to port to the GPU. Using these gives you the quickest result with the least effort.
移植过程是将工作移动到 GPU,然后管理数据移动。这样可以在 GPU 上完成尽可能多的工作,同时最大限度地减少昂贵的数据移动。
The porting process is to move work to the GPU and then manage the data movement. This gets as much work as possible on the GPU while minimizing expensive data movement.
The kernel optimization comes last and should mostly be left to the compiler. This produces the most portable and future-proof code.
Track the latest developments of the pragma-based language and compilers. These compilers are still under rapid development and should continue improving.
本章介绍 GPU 的低级语言。我们称这些为本地语言,因为它们直接反映了目标 GPU 硬件的功能。我们介绍了其中两种语言,CUDA 和 OpenCL,它们被广泛使用。我们还介绍了 HIP,这是 AMD GPU 的新变体。与基于 pragma 的实现相比,这些 GPU 语言对编译器的依赖较小。您应该使用这些语言来更精细地控制程序的性能。这些语言与第 11 章中介绍的语言有何不同?我们的区别在于,这些语言是从 GPU 和 CPU 硬件的特性发展而来的,而 OpenACC 和 OpenMP 语言从高级抽象开始,并依靠编译器将它们映射到不同的硬件。
This chapter covers lower-level languages for GPUs. We call these native languages because they directly reflect features of the target GPU hardware. We cover two of these languages, CUDA and OpenCL, that are widely used. We also cover HIP, a new variant for AMD GPUs. In contrast to the pragma-based implementation, these GPU languages have a smaller reliance on the compiler. You should use these languages for more fine-tuned control of your program’s performance. How are these languages different than those presented in chapter 11? Our distinction is that these languages have grown up from the characteristics of the GPU and CPU hardware, while the OpenACC and OpenMP languages started with high-level abstractions and rely on a compiler to map those to different hardware.
本机 GPU 语言集 CUDA、OpenCL 和 HIP 需要为 GPU 内核创建单独的源。单独的源代码通常类似于 CPU 代码。需要维护两个不同来源的挑战是一个主要困难。如果本机 GPU 语言仅支持一种类型的硬件,那么如果您想在多个供应商的 GPU 上运行,则可能需要维护更多的源变体。一些应用程序已经用多种 GPU 语言和 CPU 语言实现了他们的算法。因此,您可以了解对更可移植的 GPU 编程语言的迫切需求。
The set of native GPU languages, CUDA, OpenCL, and HIP, requires a separate source to be created for the GPU kernel. The separate source code is often similar to the CPU code. The challenges of having two different sources to maintain is a major difficulty. If the native GPU language only supports one type of hardware, then there can be even more source variants to maintain if you want to run on more than one vendor’s GPU. Some applications have implemented their algorithms in multiple GPU languages and CPU languages. Thus, you can understand the critical need for more portable GPU programming languages.
值得庆幸的是,一些较新的 GPU 语言的可移植性受到了更多关注。OpenCL 是第一种在各种 GPU 硬件甚至 CPU 上运行的开放标准语言。在最初的轰动之后,OpenCL 并没有像最初希望的那样得到广泛的接受。另一种语言 HIP 由 AMD 设计,作为 CUDA 的更便携版本,它为 AMD 的 GPU 生成代码。作为 AMD 可移植性计划的一部分,包括对其他供应商的 GPU 的支持。
Thankfully, portability is getting more attention with some of the newer GPU languages. OpenCL was the first open-standard language to run on a variety of GPU hardware and even CPUs. After an initial splash, OpenCL has not gotten as widespread an acceptance as originally hoped for. Another language, HIP, is designed by AMD as a more portable version of CUDA, which generates code for AMD’s GPUs. As part of AMD’s portability initiative, support for GPUs from other vendors is included.
随着新语言的引入,这些母语和高级语言之间的区别越来越模糊。SYCL 语言最初是 OpenCL 之上的 C++ 层,是这些更新、更可移植的语言的典型代表。除了 Kokkos 和 RAJA 语言外,SYCL 还支持 CPU 和 GPU 的单一来源。我们将在本章的结尾谈到这些语言。图 12.1 显示了本章中介绍的 GPU 语言互操作性的当前图片。
The difference between these native languages and higher-level languages is blurring as new languages are introduced. The SYCL language, originally a C++ layer on top of OpenCL, is typical of these newer, more portable languages. Along with the Kokkos and RAJA languages, SYCL supports a single source for both CPU and GPU. We’ll touch on these languages at the end of the chapter. Figure 12.1 shows the current picture of the interoperability for the GPU languages that we cover in this chapter.
随着 GPU 的多样性出现在最大的 HPC 安装中,对语言互操作性的关注越来越受到关注。能源部的顶级 HPC 系统 Sierra 和 Summit 都配置了 NVIDIA GPU。2021 年,阿贡的 Intel GPU 的 Aurora 系统和橡树岭的 AMD GPU 的 Frontier 系统将被添加到能源部 HPC 系统的列表中。随着 Aurora 系统的推出,SYCL 已经从几乎默默无闻转变为拥有多种实现的主要参与者。SYCL 最初是为了在 OpenCL 之上提供更自然的 C++ 层而开发的。SYCL 突然出现的原因是它被 Intel 用作 Aurora 系统上 Intel GPU 的 OneAPI 编程模型的一部分。由于 SYCL 新发现的重要性,我们在第 12.4 节中介绍了 SYCL。对提供跨 GPU 领域的可移植性的其他语言和库的兴趣也普遍存在。
The focus on language interoperability is gaining traction as more diversity of GPUs appears in the largest HPC installations. The top Department of Energy HPC systems, Sierra and Summit, are provisioned with NVIDIA GPUs. In 2021, Argonne’s Aurora system with Intel GPUs and Oak Ridge’s Frontier system with AMD GPUs will be added to the list of Department of Energy HPC systems. With the introduction of the Aurora system, SYCL has emerged from near obscurity to become a major player with multiple implementations. SYCL was originally developed to provide a more natural C++ layer on top of OpenCL. The reason for the sudden emergence of SYCL was its adoption by Intel as part of the OneAPI programming model for Intel GPUs on the Aurora system. Because of SYCL’s new-found importance, we cover SYCL in section 12.4. A similar growth in interest in other languages and libraries that provide portability across the GPU landscape is also prevalent.
在本章的最后,我们简要介绍了其中的几个性能可移植性系统,即 Kokkos 和 RAJA,它们旨在缓解在各种硬件(从 CPU 到 GPU)上运行的困难。这些在稍高的抽象级别上工作,但承诺将运行在任何地方的单一源。它们的开发是能源部的一项重大努力的结果,该工作旨在支持将大型科学应用程序移植到更新的硬件。RAJA 和 Kokkos 的目标是一次性重写,以创建一个单一源代码库,该代码库在硬件设计发生巨大变化时是可移植和可维护的。
We end the chapter with a brief look at a couple of these performance portability systems, Kokkos and RAJA, that were created to ease the difficulty of running on a wide range of hardware, from CPUs to GPUs. These work at a slightly higher level of abstraction, but promise a single source that will run everywhere. Their development has resulted from a major Department of Energy effort to support the porting of large scientific applications to newer hardware. The aim of RAJA and Kokkos is a one-time rewrite to create a single-source code base that is portable and maintainable through a time of great change in hardware design.
最后,我们想提供有关如何处理本章的指导。我们在很短的时间内涵盖了许多不同的语言。语言的激增反映了目前语言开发人员之间缺乏合作,因为开发人员追求他们的直接目标和硬件问题。与其将这些语言视为不同的语言,不如将它们视为一种或两种语言的略有不同的方言。我们建议您尝试学习其中的几种语言,并了解与其他语言的异同。我们将对这两种语言进行比较和对比,以帮助您了解每种语言的特定语法及其怪癖后,它们并没有什么不同。我们确实预计这些语言将合并为更通用的形式,因为目前的情况是不可持续的。我们已经看到了这种情况的开始,由大型应用程序的需求推动了更多的语言可移植性。
Last, we want to provide guidance on how to approach this chapter. We cover a lot of different languages in a short space. The proliferation of languages reflects the lack of cooperation among language developers at this point in time, as developers chase their immediate goals and hardware concerns. Rather than treat these languages as different languages, think of them as slightly different dialects of one or two languages. We recommend that you seek to learn a couple of these languages and appreciate the differences and similarities with the others. We will be comparing and contrasting the languages to help you see that they are not all that different once you get over the particular syntax of each and their quirks. We do expect that the languages will merge to a more common form because the current situation is not sustainable. We already see the beginnings of that with the push for more language portability driven by the needs of large applications.
GPU 编程语言必须具备几个基本功能。了解这些功能是什么会很有帮助,这样您就可以在每种 GPU 语言中识别这些功能。我们在此处总结了必要的 GPU 语言功能。
A GPU programming language must have several basic features. It is helpful to understand what these features are so that you can recognize these in each GPU language. We summarize the necessary GPU language features here.
检测加速器设备 — 该语言必须提供加速器设备的检测以及在这些设备之间进行选择的方法。某些语言比其他语言对设备的选择有更多的控制权。即使对于像 CUDA 这样只寻找 NVIDIA GPU 的语言,也必须有一种方法可以处理节点上的多个 GPU。
Detecting the accelerator device—The language must provide a detection of the accelerator devices and a way to choose between those devices. Some languages give more control over the selection of devices than others. Even for a language such as CUDA, which just looks for an NVIDIA GPU, there must be a way to handle multiple GPUs on a node.
支持编写设备内核 — 该语言必须提供一种为 GPU 或其他加速器生成低级指令的方法。GPU 提供与 CPU 几乎相同的基本操作,因此内核语言不应有太大差异。与其发明一门新语言,不如利用当前的编程语言和编译器来生成新的指令集。GPU 语言通过采用特定版本的 C 或 C++ 语言作为其系统的基础来实现这一点。CUDA 最初基于 C 编程语言,但现在基于 C++,并且对标准模板库 (STL) 有一些支持。OpenCL 基于 C99 标准,并发布了支持 C++ 的新规范。
Support for writing device kernels—The language must provide a way to generate the low-level instructions for GPUs or other accelerators. GPUs provide nearly identical basic operations as CPUs, so the kernel language should not be dramatically different. Rather than invent a new language, the most straightforward way is to leverage current programming languages and compilers to generate the new instruction set. GPU languages have done this by adopting a particular version of the C or C++ language as a basis for their system. CUDA originally was based on the C programming language but now is based on C++ and has some support for the Standard Template Library (STL). OpenCL is based on the C99 standard and has released a new specification with C++ support.
语言设计还需要解决是将 host 和 design 源代码放在同一个文件中还是放在不同的文件中。无论哪种方式,编译器都必须区分 host 和 design 源,并且必须提供一种为不同硬件生成指令集的方法。编译器甚至必须决定何时生成指令集。例如,OpenCL 等待设备被选中,然后使用即时 (JIT) 编译器方法生成指令集。
The language design also needs to address whether to have the host and design source code in the same file or in different files. Either way, the compiler must distinguish between the host and design sources and must provide a way to generate the instruction set for the different hardware. The compiler must even decide when to generate the instruction set. For example, OpenCL waits for the device to be selected and then generates the instruction set with a just-in-time (JIT) compiler approach.
从主机调用设备内核的机制 — 好了,现在我们有了设备代码,但我们还必须有一种从主机调用代码的方法。在各种语言中,执行此操作的语法差异最大。但该机制仅比标准子例程调用稍微复杂一些。
Mechanism to call device kernels from the host—Ok, now we have the device code, but we also have to have a way of calling the code from the host. The syntax for performing this operation varies the most across the various languages. But the mechanism is only slightly more complicated than a standard subroutine call.
内存处理 — 该语言必须支持内存分配、释放以及在主机和设备之间来回移动数据。最直接的方法是为每个操作调用一个子例程。但另一种方法是通过编译器检测何时移动数据并在后台为您执行此操作。由于这是 GPU 编程的主要部分,因此此功能的硬件和软件方面不断进行创新。
Memory handling—The language must have support for memory allocations, deallocations, and moving data back and forth from the host to the device. The most straightforward way for this is to have a subroutine call for each of these operations. But another way is through the compiler detecting when to move the data and doing it for you behind the scenes. As this is such a major part of GPU programming, innovation continues to occur on the hardware and software side for this functionality.
Synchronization—A mechanism must be provided to specify the synchronization requirements between the CPU and the GPU. Synchronization operations must also be provided within kernels.
Streams—A complete GPU language allows the scheduling of asynchronous streams of operations along with the explicit dependencies between the kernels and the memory transfer operations.
这不是一个可怕的列表。在大多数情况下,本机 GPU 语言看起来与当前的 CPU 代码没有太大区别。此外,认识到本机 GPU 语言功能之间的这些共性有助于您适应从一种语言迁移到另一种语言。
This is not such a scary list. For the most part, native GPU languages do not look so different than current CPU code. Also recognizing these commonalities among native GPU language functionality helps you to become comfortable moving from one language to another.
我们首先要看一下两种低级 GPU 语言,即 CUDA 和 HIP。这是 GPU 编程的两种最常见的语言。
We will begin with a look at two of the low level GPU languages, CUDA and HIP. These are two of the most common languages for programming GPUs.
计算统一设备架构 (CUDA) 是 NVIDIA 的一种专有语言,仅在其 GPU 上运行。它于 2008 年首次发布,目前是 GPU 的主要原生编程语言。经过十年的发展,CUDA 具有丰富的功能和性能增强。CUDA 语言与 NVIDIA GPU 的架构密切相关。它并不声称是一种通用的加速器语言。尽管如此,大多数加速器的概念都足够相似,因此 CUDA 语言设计适用。
Compute Unified Device Architecture (CUDA) is a proprietary language from NVIDIA that only runs on their GPUs. First released in 2008, it is currently the dominant native programming language for GPUs. With a decade of development, CUDA has a rich set of features and performance enhancements. The CUDA language closely reflects the architecture of the NVIDIA GPU. It does not purport to be a general accelerator language. Still, the concepts of most accelerators are similar enough for the CUDA language design to be applicable.
AMD(以前称为 ATI)GPU 拥有一系列短命的编程语言。这些最终确定了 CUDA 外观,该 CUDA 外观可以通过使用 HIP 编译器 “HIPifying” CUDA 代码生成。这是 ROCm 工具套件的一部分,可在 GPU 语言之间提供广泛的可移植性,包括第 12.3 节中讨论的 GPU(和 CPU)的 OpenCL 语言。
The AMD (formerly ATI) GPUs have had a series of short-lived programming languages. These have finally settled on a CUDA look-a-like that can be generated by “HIPifying” CUDA code with their HIP compiler. This is part of the ROCm suite of tools that provide extensive portability between GPU languages, including the OpenCL language for GPUs (and CPUs) discussed in section 12.3.
我们将从如何构建和编译在 GPU 上运行的简单 CUDA 应用程序开始。我们将使用我们在整本书中使用的流三元组示例,该示例为此计算实现了一个循环:C = A + 标量 * B。CUDA 编译器拆分常规 C++ 代码以传递给底层 C++ 编译器。然后,它会编译剩余的 CUDA 代码。来自这两个路径的代码将链接在一起,形成一个可执行文件。
We’ll start with how to build and compile a simple CUDA application that runs on a GPU. We’ll use the stream triad example we have used throughout the book that implements a loop for this calculation: C = A + scalar * B. The CUDA compiler splits the regular C++ code to pass to the underlying C++ compiler. It then compiles the remaining CUDA code. Code from these two paths is linked together into a single executable.
要按照此示例进行操作,您可能首先需要安装 CUDA 软件。1 每个版本的 CUDA 都适用于有限范围的编译器版本。从 CUDA v10.2 开始,支持 GCC 编译器到 v8。如果您正在使用多种并行语言和包,那么这个不断与编译器版本作斗争的问题可能是 CUDA 最令人沮丧的事情之一。但从积极的方面来看,您可以使用大部分常规工具链,只需版本约束和一些特殊添加即可构建系统。
To follow along with this example, you might first need to install the CUDA software.1 Each release of CUDA works with a limited range of compiler versions. As of CUDA v10.2, GCC compilers up through v8 are supported. If you are working with multiple parallel languages and packages, this constantly-battling-the-compiler-version issue is perhaps one of the most frustrating things about CUDA. But on a positive note, you can use much of your regular toolchain and build systems with just the version constraints and a few special additions.
我们将展示三种不同的方法,从简单的 makefile 开始,然后是几种不同的使用 CMake 的方法。我们鼓励您按照 https://github.com/EssentialsofParallelComputing/ Chapter12 中的本章示例进行操作。
We’ll show three different approaches starting with a simple makefile and then a couple of different ways of using CMake. We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/ Chapter12.
您可以通过将 CUDA 复制或链接到 makefile(make 的默认文件名)来为 CUDA 选择这个简单的 makefile。下面的清单显示了 makefile 本身。
You can select this simple makefile for CUDA by copying or linking it to Makefile, the default filename for make. The following listing shows the makefile itself.
Listing 12.1 A simple CUDA makefile
CUDA/StreamTriad/Makefile.simple 1 all: StreamTriad 2 3 NVCC = nvcc ❶ 4 #NVCC_FLAGS = -arch=sm_30 ❷ 5 #CUDA_LIB = <path> ❷ 6 CUDA_LIB=`which nvcc | sed -e 's!/bin/nvcc!!'`/lib 7 CUDA_LIB64=`which nvcc | sed -e 's!/bin/nvcc!!'`/lib64 8 9 %.o : %.cu ❸ 10 ${NVCC} ${NVCC_FLAGS} -c $< -o $@ ❸ 11 12 StreamTriad: StreamTriad.o timer.o 13 ${CXX} -o $@ $^ -L${CUDA_LIB} -lcudart ❹ 14 15 clean: 16 rm -rf StreamTriad *.o
CUDA/StreamTriad/Makefile.simple 1 all: StreamTriad 2 3 NVCC = nvcc ❶ 4 #NVCC_FLAGS = -arch=sm_30 ❷ 5 #CUDA_LIB = <path> ❷ 6 CUDA_LIB=`which nvcc | sed -e 's!/bin/nvcc!!'`/lib 7 CUDA_LIB64=`which nvcc | sed -e 's!/bin/nvcc!!'`/lib64 8 9 %.o : %.cu ❸ 10 ${NVCC} ${NVCC_FLAGS} -c $< -o $@ ❸ 11 12 StreamTriad: StreamTriad.o timer.o 13 ${CXX} -o $@ $^ -L${CUDA_LIB} -lcudart ❹ 14 15 clean: 16 rm -rf StreamTriad *.o
❶ Specifies NVIDIA CUDA compiler
❷ You may need to set library path and GPU architecture type here.
❸ Implicit rule to compile CUDA source files
❹ Link line for CUDA applications
关键添加是第 9-10 行的模式规则,该规则将带有 .cu 后缀的文件转换为对象文件。我们使用 NVIDIA NVCC 编译器进行此操作。然后,我们需要将 CUDA 运行时库 CUDART 添加到链接行中。您可以使用第 4 行和第 5 行来指定特定的 NVIDIA GPU 架构和 CUDA 库的特殊路径。
The key addition is a pattern rule on lines 9-10, which converts a file with a .cu suffix into an object file. We use the NVIDIA NVCC compiler for this operation. We then need to add the CUDA runtime library, CUDART, to the link line. You can use lines 4 and 5 to specify a particular NVIDIA GPU architecture and a special path to the CUDA libraries.
定义模式规则是 make 实用程序的规范,它提供了有关如何将具有一种后缀模式的任何文件转换为具有另一种后缀模式的文件的一般规则。
Definition A pattern rule is a specification to the make utility that provides a general rule on how to convert any file with one suffix pattern to a file with another suffix pattern.
CUDA 在 CMake 构建系统中得到了广泛的支持。接下来,我们将介绍旧式支持和最近出现的新现代 CMake 方法。我们在 Listing 12.2 中展示了旧式方法。它的优势在于,对于具有较旧 CMake 版本的系统具有更高的可移植性,并且可以自动检测 NVIDIA GPU 架构。检测硬件设备的后一个功能非常方便,以至于目前推荐使用旧式 CMake 的方法。要使用此构建系统,请将CMakeLists_old.txt链接到 CMakeLists.txt:
CUDA has extensive support in the CMake build system. Next, we cover both the old-style support and the new modern CMake approach that’s recently emerged. We show the old-style method in listing 12.2. It has the advantage of more portability for systems with older CMake versions and the automatic detection of the NVIDIA GPU architecture. This latter feature of detecting the hardware device is such a convenience that the old-style CMake is the recommended approach at present. To use this build system, link the CMakeLists_old.txt to CMakeLists.txt:
ln -s CMakeLists_old.txt CMakeLists.txt mkdir build && cd build cmake .. make
ln -s CMakeLists_old.txt CMakeLists.txt mkdir build && cd build cmake .. make
Listing 12.2 Old style CUDA CMake file
CUDA/StreamTriad/CMakeLists_old.txt 1 cmake_minimum_required (VERSION 2.8) ❶ 2 project (StreamTriad) 3 4 find_package(CUDA REQUIRED) ❷ 5 6 set (CMAKE_CXX_STANDARD 11) 7 set (CMAKE_CUDA_STANDARD 11) 8 9 # sets CMAKE_{C,CXX}_FLAGS from CUDA compile flags. # Includes DEBUG and RELEASE 10 set (CUDA_PROPAGATE_HOST_FLAGS ON) # default is on 11 set (CUDA_SEPARABLE_COMPILATION ON) ❸ 12 13 if (CMAKE_VERSION VERSION_GREATER "3.9.0") 14 cuda_select_nvcc_arch_flags(ARCH_FLAGS) ❹ 15 endif() 16 17 set (CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} ❺ -O3 ${ARCH_FLAGS}) ❺ 18 19 # Adds build target of StreamTriad with source code files 20 cuda_add_executable(StreamTriad ❻ StreamTriad.cu timer.c timer.h) ❻ 21 22 if (APPLE) 23 set_property(TARGET StreamTriad PROPERTY BUILD_RPATH ${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES}) 24 endif (APPLE) 25 26 # Cleanup 27 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles 28 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt) 29 30 # Adds a make clean_cuda_depends target # -- invoke with "make clean_cuda_depends" 31 CUDA_BUILD_CLEAN_TARGET()
CUDA/StreamTriad/CMakeLists_old.txt 1 cmake_minimum_required (VERSION 2.8) ❶ 2 project (StreamTriad) 3 4 find_package(CUDA REQUIRED) ❷ 5 6 set (CMAKE_CXX_STANDARD 11) 7 set (CMAKE_CUDA_STANDARD 11) 8 9 # sets CMAKE_{C,CXX}_FLAGS from CUDA compile flags. # Includes DEBUG and RELEASE 10 set (CUDA_PROPAGATE_HOST_FLAGS ON) # default is on 11 set (CUDA_SEPARABLE_COMPILATION ON) ❸ 12 13 if (CMAKE_VERSION VERSION_GREATER "3.9.0") 14 cuda_select_nvcc_arch_flags(ARCH_FLAGS) ❹ 15 endif() 16 17 set (CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} ❺ -O3 ${ARCH_FLAGS}) ❺ 18 19 # Adds build target of StreamTriad with source code files 20 cuda_add_executable(StreamTriad ❻ StreamTriad.cu timer.c timer.h) ❻ 21 22 if (APPLE) 23 set_property(TARGET StreamTriad PROPERTY BUILD_RPATH ${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES}) 24 endif (APPLE) 25 26 # Cleanup 27 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles 28 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt) 29 30 # Adds a make clean_cuda_depends target # -- invoke with "make clean_cuda_depends" 31 CUDA_BUILD_CLEAN_TARGET()
❶ 您至少需要 CMake v2.8 才能获得 CUDA 支持。
❶ You need a minimum of CMake v2.8 for CUDA support.
❷ Traditional CMake module sets compiler flags.
❸ 设置为 “on” 以调用其他编译单元中的函数(默认关闭)
❸ Set to “on” for calling functions in other compile units (default off)
❹ Detects and sets proper architecture flag for current NVIDIA GPU
❺ Sets the compiler flags for the NVIDIA compiler
❻ Sets the proper build and link flags for a CUDA executable
CMake 构建系统的大部分都是标准的。建议在第 11 行使用 separable 编译属性,以便为一般开发提供更健壮的构建系统。然后,您可以在稍后阶段将其关闭,以在 CUDA 内核中保存一些寄存器,以便在生成的代码中进行小的优化。CUDA 默认值是为了性能,而不是为了更通用、更健壮的构建。第 14 行的 NVIDIA GPU 架构自动检测非常方便,使您不必手动修改 makefile。
Much of the CMake build system is standard. The separable compilation attribute on line 11 is suggested for a more robust build system for general development. You can then turn it off at a later stage to save a few registers in the CUDA kernels to get a small optimization in the generated code. The CUDA defaults are for performance, not for a more general, robust build. The automatic detection of the NVIDIA GPU architecture on line 14 is a significant convenience that keeps you from having to manually modify your makefile.
在 3.0 版本中,CMake 正在对其结构进行相当大的修改,以达到他们所谓的“现代”CMake。此样式的关键属性是集成度更高的系统和每个目标的属性应用程序。没有比它对 CUDA 的支持更明显的了。让我们看一下清单 12.3 来了解如何使用它。要将此构建系统用于 CUDA 的现代新式 CMake 支持,请将CMakeLists_new.txt链接到 CMakeLists.txt:
With version 3.0, CMake is undergoing a fairly major revision to its structure to what they call “modern” CMake. The key attributes of this style are a more integrated system and a per target application of attributes. Nowhere is it more apparent than in its support of CUDA. Let’s take a look at the listing 12.3 to see how to use it. To use this build system for the modern, new style CMake support for CUDA, link the CMakeLists_new.txt to CMakeLists.txt:
ln -s CMakeLists_new.txt CMakeLists.txt mkdir build && cd build cmake .. make
ln -s CMakeLists_new.txt CMakeLists.txt mkdir build && cd build cmake .. make
Listing 12.3 New style (modern) CUDA CMake file
CUDA/StreamTriad/CMakeLists_new.txt 1 cmake_minimum_required (VERSION 3.8) ❶ 2 project (StreamTriad) 3 4 enable_language(CXX CUDA) ❷ 5 6 set (CMAKE_CXX_STANDARD 11) 7 set (CMAKE_CUDA_STANDARD 11) 8 9 #set (ARCH_FLAGS -arch=sm_30) ❸ 10 set (CMAKE_CUDA_FLAGS ${CMAKE_CUDA_FLAGS}; ❹ "-O3 ${ARCH_FLAGS}") ❹ 11 12 # Adds build target of StreamTriad with source code files 13 add_executable(StreamTriad StreamTriad.cu timer.c timer.h) 14 15 set_target_properties(StreamTriad PROPERTIES ❺ CUDA_SEPARABLE_COMPILATION ON) ❺ 16 17 if (APPLE) 18 set_property(TARGET StreamTriad PROPERTY BUILD_RPATH ${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES}) 19 endif(APPLE) 20 21 # Cleanup 22 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles 23 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt)
CUDA/StreamTriad/CMakeLists_new.txt 1 cmake_minimum_required (VERSION 3.8) ❶ 2 project (StreamTriad) 3 4 enable_language(CXX CUDA) ❷ 5 6 set (CMAKE_CXX_STANDARD 11) 7 set (CMAKE_CUDA_STANDARD 11) 8 9 #set (ARCH_FLAGS -arch=sm_30) ❸ 10 set (CMAKE_CUDA_FLAGS ${CMAKE_CUDA_FLAGS}; ❹ "-O3 ${ARCH_FLAGS}") ❹ 11 12 # Adds build target of StreamTriad with source code files 13 add_executable(StreamTriad StreamTriad.cu timer.c timer.h) 14 15 set_target_properties(StreamTriad PROPERTIES ❺ CUDA_SEPARABLE_COMPILATION ON) ❺ 16 17 if (APPLE) 18 set_property(TARGET StreamTriad PROPERTY BUILD_RPATH ${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES}) 19 endif(APPLE) 20 21 # Cleanup 22 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles 23 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt)
❸ Manually sets the CUDA architecture
❹ Sets the compile flags for CUDA
❺ Sets the separable compilation flag
这种现代 CMake 方法首先要注意的是它比旧式要简单得多。关键是启用 CUDA 作为第 4 行中的语言。从那时起,几乎不需要做额外的工作。
The first thing to note with this modern CMake approach is how much simpler it is than the old style. The key is the enabling of the CUDA as the language in line 4. From then on, little additional work needs to be done.
我们可以设置标志以针对特定的 GPU 架构进行编译,如第 9-10 行所示。但是,我们还没有一种自动方法来检测现代 CMake 风格的架构。如果没有架构标志,编译器会生成代码并针对 sm_30 GPU 设备进行优化。sm_30生成的代码可以在 Kepler K40 或更高版本的任何设备上运行,但不会针对最新架构进行优化。您还可以在一个编译器中指定多个体系结构。编译速度会变慢,生成的可执行文件会更大。
We can set the flags to compile for a specific GPU architecture as shown in lines 9-10. However, we don’t have an automatic way to detect the architecture yet with the modern CMake style. Without an architecture flag, the compiler generates code and optimizes for the sm_30 GPU device. The sm_30 generated code runs on any device from Kepler K40 or newer, but it will not be optimized for the latest architectures. You can also specify multiple architectures in one compiler. Compiles will be slower, and the generated executable will be larger.
我们还可以为 CUDA 设置 separable 编译属性,但使用不同的语法,它适用于特定目标。第 10 行的优化标志 -O3 仅发送到常规 C++ 代码的主机编译器。CUDA 代码的默认优化级别为 -O3 ,很少需要修改。
We can also set the separable compilation attribute for CUDA, but in a different syntax in which it applies to the specific target. The optimization flag on line 10, -O3, is only sent to the host compiler for the regular C++ code. The default optimization level for CUDA code is -O3 and seldom needs to be modified.
总体而言,构建 CUDA 程序的过程很简单,而且越来越容易。但是,预计对 build 的更改会继续进行。Clang 正在添加对编译 CUDA 代码的原生支持,为您提供除 NVIDIA 编译器之外的其他选择。现在让我们继续讨论源代码。我们将从以下清单中 GPU 的内核开始。
Overall, the process of building a CUDA program is easy and getting easier. Expect changes to the build to continue, however. Clang is adding native support for compiling CUDA code to give you another option besides the NVIDIA compiler. Now let’s move on to the source code. We’ll begin with the kernel for the GPU in the following listing.
Listing 12.4 CUDA version of stream triad: The kernel
CUDA/StreamTriad/StreamTriad.cu
2 __global__ void StreamTriad(
3 const int n,
4 const double scalar,
5 const double *a,
6 const double *b,
7 double *c)
8 {
9 int i = blockIdx.x*blockDim.x+threadIdx.x; ❶
10
11 // Protect from going out-of-bounds
12 if (i >= n) return; ❷
13
14 c[i] = a[i] + scalar*b[i]; ❸
15 }CUDA/StreamTriad/StreamTriad.cu
2 __global__ void StreamTriad(
3 const int n,
4 const double scalar,
5 const double *a,
6 const double *b,
7 double *c)
8 {
9 int i = blockIdx.x*blockDim.x+threadIdx.x; ❶
10
11 // Protect from going out-of-bounds
12 if (i >= n) return; ❷
13
14 c[i] = a[i] + scalar*b[i]; ❸
15 }
❷ Protects from going out-of-bounds
与 GPU 内核的典型情况一样,我们从计算块中剥离 for 循环。这样,循环体就留在第 14 行。我们需要在第 12 行添加条件,以防止访问越界数据。如果没有这种保护,内核可能会随机崩溃,而不会显示任何消息。然后,在第 9 行中,我们从 CUDA 运行时设置的 block 和 thread 变量中获取全局索引。将 __global__ 属性添加到子例程会告诉编译器这是将从主机调用的 GPU 内核。同时,在主机端,我们必须设置内存并进行内核调用。下面的清单显示了此过程。
As is typical with GPU kernels, we strip the for loop from the computational block. This leaves the loop body on line 14. We need to add the conditional at line 12 to prevent accessing out-of-bounds data. Without this protection, kernels can randomly crash without a message. And then, in line 9, we get the global index from the block and thread variables set by the CUDA run time. Adding the __global__ attribute to the subroutine tells the compiler that this is a GPU kernel that will be called from the host. Meanwhile on the host side, we have to set up the memory and make the kernel call. The following listing shows this process.
Listing 12.5 CUDA version of stream triad: Set up and tear down
CUDA/StreamTriad/StreamTriad.cu 31 // allocate host memory and initialize 32 double *a = (double *)malloc( ❶ stream_array_size*sizeof(double)); ❶ 33 double *b = (double *)malloc( ❶ stream_array_size*sizeof(double)); ❶ 34 double *c = (double *)malloc( ❶ stream_array_size*sizeof(double)); ❶ 35 36 for (int i=0; i<stream_array_size; i++) { 37 a[i] = 1.0; ❷ 38 b[i] = 2.0; ❷ 39 } 40 41 // allocate device memory. suffix of _d indicates a device pointer 42 double *a_d, *b_d, *c_d; 43 cudaMalloc(&a_d, stream_array_size* ❸ sizeof(double)); ❸ 44 cudaMalloc(&b_d, stream_array_size* ❸ sizeof(double)); ❸ 45 cudaMalloc(&c_d, stream_array_size* ❸ sizeof(double)); ❸ 46 47 // setting block size and padding total grid size // to get even block sizes 48 int blocksize = 512; ❹ 49 int gridsize = ❹ (stream_array_size + blocksize - 1)/ ❹ blocksize; ❹ 50 < ... timing loop ... code shown below in listing 12.6 > 78 printf("Average runtime is %lf msecs data transfer is %lf msecs\n", 79 tkernel_sum/NTIMES, (ttotal_sum - tkernel_sum)/NTIMES); 80 81 cudaFree(a_d); ❺ 82 cudaFree(b_d); ❺ 83 cudaFree(c_d); ❺ 84 85 free(a); ❻ 86 free(b); ❻ 87 free(c); ❻ 88 }
CUDA/StreamTriad/StreamTriad.cu 31 // allocate host memory and initialize 32 double *a = (double *)malloc( ❶ stream_array_size*sizeof(double)); ❶ 33 double *b = (double *)malloc( ❶ stream_array_size*sizeof(double)); ❶ 34 double *c = (double *)malloc( ❶ stream_array_size*sizeof(double)); ❶ 35 36 for (int i=0; i<stream_array_size; i++) { 37 a[i] = 1.0; ❷ 38 b[i] = 2.0; ❷ 39 } 40 41 // allocate device memory. suffix of _d indicates a device pointer 42 double *a_d, *b_d, *c_d; 43 cudaMalloc(&a_d, stream_array_size* ❸ sizeof(double)); ❸ 44 cudaMalloc(&b_d, stream_array_size* ❸ sizeof(double)); ❸ 45 cudaMalloc(&c_d, stream_array_size* ❸ sizeof(double)); ❸ 46 47 // setting block size and padding total grid size // to get even block sizes 48 int blocksize = 512; ❹ 49 int gridsize = ❹ (stream_array_size + blocksize - 1)/ ❹ blocksize; ❹ 50 < ... timing loop ... code shown below in listing 12.6 > 78 printf("Average runtime is %lf msecs data transfer is %lf msecs\n", 79 tkernel_sum/NTIMES, (ttotal_sum - tkernel_sum)/NTIMES); 80 81 cudaFree(a_d); ❺ 82 cudaFree(b_d); ❺ 83 cudaFree(c_d); ❺ 84 85 free(a); ❻ 86 free(b); ❻ 87 free(c); ❻ 88 }
❹ Sets block size and calculates number of blocks
首先,我们在主机上分配内存并在第 31-39 行对其进行初始化。我们还需要在 GPU 上有一个相应的内存空间,以便在 GPU 对数组进行操作时保存数组。为此,我们在第 43-45 行上使用 cudaMalloc 例程。现在我们来看看一些有趣的行(从 47-49),这些行仅适用于 GPU。数据块大小是 GPU 上工作组的大小。这可以通过图块大小、块大小或工作组大小来了解,具体取决于所使用的 GPU 编程语言(请参阅表 10.1)。计算网格大小的下一行是 GPU 代码的特征。我们不会总是有数组大小是块大小的偶数整数倍。因此,我们需要一个等于或大于小数块数的整数。让我们通过一个示例来了解正在做什么。
First, we allocate memory on the host and initialize it on lines 31-39. We also need a corresponding memory space on the GPU to hold the arrays while the GPU is operating on those. For that, we use the cudaMalloc routine on lines 43-45. Now we come to some interesting lines (from 47-49) that are needed solely for the GPU. The block size is the size of the workgroup on the GPU. This is known by the tile size, block size, or workgroup size, depending on the GPU programming language being used (see table 10.1). The next line that calculates the grid size is characteristic of GPU code. We won’t always have an array size that is an even integer multiple of the block size. So, we need to have an integer that is equal to or greater than the fractional number of blocks. Let’s work through an example to understand what is being done.
现在,除了最后一个区块之外,所有区块都有 512 个值。最后一个块的大小为 512,但仅包含 488 个数据项。清单 12.4 的第 12 行的越界检查可以防止我们在这个部分填充的块上遇到麻烦。清单 12.5 中的最后几行释放了设备指针和主机指针。您必须小心地将 cudaFree 用于设备指针,将 C 库函数 free 用于主机指针。
Now all the blocks but the last one have 512 values. The last block will be size 512, but will contain only 488 data items. The out-of-bounds check on line 12 of listing 12.4 keeps us from getting in trouble with this partially filled block. The last few lines in listing 12.5 free the device pointers and the host pointers. You must be careful to use cudaFree for the device pointers and the C library function, free, for host pointers.
我们剩下的就是将内存复制到 GPU,调用 GPU 内核,然后将内存复制回来。我们在 timing loop (在 Listing 12.6 中) 中执行此操作,该 loop 可以多次执行以获得更好的测量。有时,由于初始化成本,对 GPU 的首次调用会变慢。我们可以通过运行多次迭代来摊销它。如果这还不够,您还可以丢弃第一次迭代的计时。
All we have left is to copy memory to the GPU, call the GPU kernel, and copy the memory back. We do this in a timing loop (in listing 12.6) that can be executed multiple times to get a better measurement. Sometimes the first call to a GPU will be slower due to initialization costs. We can amortize it by running several iterations. If this is not sufficient, you can also throw away the timing from the first iteration.
列表 12.6 CUDA 版本的流三元组:内核调用和定时循环
Listing 12.6 CUDA version of stream triad: Kernel call and timing loop
CUDA/StreamTriad/StreamTriad.cu
51 for (int k=0; k<NTIMES; k++){
52 cpu_timer_start(&ttotal);
53 cudaMemcpy(a_d, a, stream_array_size* ❶
sizeof(double), cudaMemcpyHostToDevice); ❶
54 cudaMemcpy(b_d, b, stream_array_size* ❶
sizeof(double), cudaMemcpyHostToDevice); ❶
55 // cuda memcopy to device returns after buffer available
56 cudaDeviceSynchronize(); ❷
57
58 cpu_timer_start(&tkernel);
59 StreamTriad<<<gridsize, blocksize>>> ❸
(stream_array_size, scalar, a_d, b_d, c_d); ❸
60 cudaDeviceSynchronize(); ❹
61 tkernel_sum += cpu_timer_stop(tkernel);
62
63 // cuda memcpy from device to host blocks for completion
// so no need for synchronize
64 cudaMemcpy(c, c_d, stream_array_size* ❺
sizeof(double), cudaMemcpyDeviceToHost); ❺
65 ttotal_sum += cpu_timer_stop(ttotal);
66 // check results and print errors if found.
// limit to only 10 errors per iteration
67 for (int i=0, icount=0; i<stream_array_size && icount < 10; i++){
68 if (c[i] != 1.0 + 3.0*2.0) {
69 printf("Error with result c[%d]=%lf on iter %d\n",i,c[i],k);
70 icount++;
71 } // if not correct, print error
72 } // result checking loop
73 } // timing for loopCUDA/StreamTriad/StreamTriad.cu
51 for (int k=0; k<NTIMES; k++){
52 cpu_timer_start(&ttotal);
53 cudaMemcpy(a_d, a, stream_array_size* ❶
sizeof(double), cudaMemcpyHostToDevice); ❶
54 cudaMemcpy(b_d, b, stream_array_size* ❶
sizeof(double), cudaMemcpyHostToDevice); ❶
55 // cuda memcopy to device returns after buffer available
56 cudaDeviceSynchronize(); ❷
57
58 cpu_timer_start(&tkernel);
59 StreamTriad<<<gridsize, blocksize>>> ❸
(stream_array_size, scalar, a_d, b_d, c_d); ❸
60 cudaDeviceSynchronize(); ❹
61 tkernel_sum += cpu_timer_stop(tkernel);
62
63 // cuda memcpy from device to host blocks for completion
// so no need for synchronize
64 cudaMemcpy(c, c_d, stream_array_size* ❺
sizeof(double), cudaMemcpyDeviceToHost); ❺
65 ttotal_sum += cpu_timer_stop(ttotal);
66 // check results and print errors if found.
// limit to only 10 errors per iteration
67 for (int i=0, icount=0; i<stream_array_size && icount < 10; i++){
68 if (c[i] != 1.0 + 3.0*2.0) {
69 printf("Error with result c[%d]=%lf on iter %d\n",i,c[i],k);
70 icount++;
71 } // if not correct, print error
72 } // result checking loop
73 } // timing for loop
❶ Copies array data from host to device
❷ Synchronizes to get accurate timing for kernel only
❹ Forces completion to get timing
❺ Copies array data back from device to host
timing loop 中的 pattern 由以下步骤组成:
The pattern in the timing loop is composed of the following steps:
我们添加了一些同步和计时器调用,以获得 GPU 内核的准确测量。在循环结束时,我们检查结果的正确性。一旦它投入生产,我们就可以删除 timing、synchronization 和 error check。对 GPU 内核的调用可以很容易地通过三个 V 形或尖括号来发现。如果我们忽略 V 形和其中包含的变量,则该行具有典型的 C 子例程调用语法:
We add some synchronization and timer calls to get an accurate measurement of the GPU kernel. At the end of the loop, we then put in a check for the correctness of the result. Once this goes into production, we can remove the timing, synchronization, and the error check. The call to the GPU kernel can easily be spotted by the triple chevrons, or angle brackets. If we ignore the chevrons and the variables contained within these, the line has a typical C subroutine call syntax:
StreamTriad(stream_array_size, scalar, a_d, b_d, c_d);
StreamTriad(stream_array_size, scalar, a_d, b_d, c_d);
The values within the parentheses are the arguments to be passed to the GPU kernel. For example
<<<gridsize, blocksize>>>
<<<gridsize, blocksize>>>
那么 V 形包含哪些参数呢?这些是 CUDA 编译器关于如何将问题分解为 GPU 块的参数。早些时候,在清单 12.2 的第 48 行到第 49 行,我们设置了区块大小,并计算了区块数量,或者说网格大小,以包含数组中的所有数据。在这种情况下,参数是 1D。我们还可以通过声明和设置这些参数来获得 2D 或 3D 数组,如下所示。
So what are the arguments contained within the chevrons? These are the arguments to the CUDA compiler on how to break up the problem into blocks for the GPU. Earlier, on lines 48 to 49 of listing 12.2, we set the block size and calculated the number of blocks, or grid size, to contain all the data in the array. The arguments in this case are 1D. We can also have 2D or 3D arrays by declaring and setting these arguments as follows for an NxN matrix.
dim3 blocksize(16,16); dim3 blocksize(8,8,8);
dim3 gridsize( (N + blocksize.x - 1)/blocksize.x,
(N + blocksize.y - 1)/blocksize.y );dim3 blocksize(16,16); dim3 blocksize(8,8,8);
dim3 gridsize( (N + blocksize.x - 1)/blocksize.x,
(N + blocksize.y - 1)/blocksize.y );
我们可以通过消除数据副本来加快内存传输。这可以通过更深入地了解操作系统的功能来实现。通过网络传输的内存必须位于操作过程中无法移动的固定位置。正常的内存分配被放置在可分页内存或可按需移动的内存中。内存传输必须首先将数据移动到固定内存或无法移动的内存中。我们在 9.4.2 节中首次看到 pinned memory 的使用,当时对 PCI 总线上的内存移动进行基准测试。我们可以通过将数组分配到固定内存而不是可分页内存中来消除内存副本。图 9.8 显示了我们可能获得的性能差异。现在,我们如何实现这一目标呢?
We can speed up the memory transfers by eliminating a data copy. This is possible through a deeper understanding of how the operating system functions. Memory that is transferred over the network must be in a fixed location that cannot be moved during the operation. Normal memory allocations are placed into pageable memory, or memory that can be moved on demand. The memory transfer must first move the data into pinned memory, or memory that cannot be moved. We first saw the use of pinned memory in section 9.4.2 when benchmarking memory movement over the PCI bus. We can eliminate a memory copy by allocating our arrays in pinned memory rather than pageable memory. Figure 9.8 shows the difference in performance that we might obtain. Now, how do we make this happen?
CUDA 为我们提供了一个函数调用 cudaHostMalloc,它为我们做了这件事。它是常规系统 malloc 例程的直接替代品,参数略有变化,其中指针作为参数返回,如下所示:
CUDA gives us a function call, cudaHostMalloc, that does this for us. It is a straight-up replacement for the regular system malloc routines, with a slight change in arguments, where the pointer is returned as an argument as shown:
double *x_host = (double *)malloc(stream_array_size*sizeof(double)); cudaMallocHost((void**)&x_host, stream_array_size*sizeof(double));
double *x_host = (double *)malloc(stream_array_size*sizeof(double)); cudaMallocHost((void**)&x_host, stream_array_size*sizeof(double));
使用固定内存有什么缺点吗?好吧,如果您确实使用了大量固定内存,则没有地方可以换入另一个应用程序。将内存换成一个应用程序并引入另一个应用程序对用户来说是一个巨大的便利。此过程称为内存分页。
Is there a downside to using pinned memory? Well, if you do use a lot of pinned memory, there is no place to swap in another application. Swapping out the memory for one application and bringing in another is a huge convenience for users. This process is called memory paging.
定义 多用户、多应用程序操作系统中的内存分页是将内存页临时移出到磁盘以便可以进行另一个过程的过程。
Definition Memory paging in multi-user, multi-application operating systems is the process of moving memory pages temporarily out to disk so that another process can take place.
内存分页是操作系统中的一项重要进步,它使你的内存看起来比实际的要多。例如,它允许您在使用 Word 时临时启动 Excel,而不必关闭原始应用程序。它通过将数据写入磁盘,然后在您返回到 Word 时将其读回来实现此目的。但是这个操作很昂贵,所以在高性能计算中,我们避免了内存分页,因为它会导致严重的性能损失。一些同时具有 CPU 和 GPU 的异构计算系统正在实现统一内存。
Memory paging is an important advance in operating systems to make it seem like you have more memory than you really do. For example, it allows you to temporarily start up Excel while working on Word and not have to close down your original application. It does this by writing your data out to disk and then reading it back when you return to Word. But this operation is expensive, so in high performance computing, we avoid memory paging because of the severe performance penalty that it incurs. Some heterogeneous computing systems with both a CPU and a GPU are implementing unified memory.
定义 统一内存是指看起来同时是 CPU 和 GPU 的单个地址空间的内存。
Definition Unified memory is memory that has the appearance of being a single address space for both the CPU and the GPU.
到目前为止,您已经看到,在 CPU 和 GPU 上处理单独的内存空间会带来编写 GPU 代码的许多复杂性。使用统一内存,GPU 运行时系统会为您处理此问题。可能仍有两个单独的数组,但数据会自动移动。在集成的 GPU 上,可能根本不需要移动内存。不过,建议编写具有显式内存副本的程序,以便将程序移植到没有统一内存的系统中。如果体系结构上不需要内存复制,则会跳过内存复制。
By now, you have seen that the handling of separate memory spaces on the CPU and the GPU introduces much of the complexity of writing GPU code. With unified memory, the GPU runtime system handles this for you. There may still be two separate arrays, but the data is moved automatically. On integrated GPUs, there is the possibility that memory does not have to be moved at all. Still, it is advisable to write your programs with explicit memory copies so that your programs are portable to systems without unified memory. The memory copy is skipped if it is not needed on the architecture.
当我们需要在 GPU 线程之间进行协作时,使用较低级别的原生 GPU 语言会变得复杂。我们将看一个简单的求和示例,看看我们该如何处理这个问题。该示例需要两个单独的 CUDA 内核,如清单 12.7-12.10 所示。下面的清单显示了第一遍,其中我们将线程块中的值相加,并将结果存储回归约暂存数组 redscratch。
When we need cooperation among GPU threads, things get complicated with lower-level, native GPU languages. We’ll look at a simple summation example to see how we can deal with this. The example requires two separate CUDA kernels and is shown in listings 12.7-12.10. The following listing shows the first pass, where we sum up the values within a thread block and store the result back out to the reduction scratch array, redscratch.
Listing 12.7 First pass of a sum reduction operation
CUDA/SumReduction/SumReduction.cu (four parts)
23 __global__ void reduce_sum_stage1of2(
24 const int isize, // 0 Total number of cells.
25 double *array, // 1
26 double *blocksum, // 2
27 double *redscratch) // 3
28 {
29 extern __shared__ double spad[]; ❶
30 const unsigned int giX = blockIdx.x*blockDim.x+threadIdx.x;
31 const unsigned int tiX = threadIdx.x;
32
33 const unsigned int group_id = blockIdx.x;
34
35 spad[tiX] = 0.0; ❷
36 if (giX < isize) { ❷
37 spad[tiX] = array[giX]; ❷
38 } ❷
39
40 __syncthreads(); ❸
41
42 reduction_sum_within_block(spad); ❹
43
44 // Write the local value back to an array
// the size of the number of groups
45 if (tiX == 0){ ❺
46 redscratch[group_id] = spad[0]; ❺
47 (*blocksum) = spad[0];
48 }
49 }CUDA/SumReduction/SumReduction.cu (four parts)
23 __global__ void reduce_sum_stage1of2(
24 const int isize, // 0 Total number of cells.
25 double *array, // 1
26 double *blocksum, // 2
27 double *redscratch) // 3
28 {
29 extern __shared__ double spad[]; ❶
30 const unsigned int giX = blockIdx.x*blockDim.x+threadIdx.x;
31 const unsigned int tiX = threadIdx.x;
32
33 const unsigned int group_id = blockIdx.x;
34
35 spad[tiX] = 0.0; ❷
36 if (giX < isize) { ❷
37 spad[tiX] = array[giX]; ❷
38 } ❷
39
40 __syncthreads(); ❸
41
42 reduction_sum_within_block(spad); ❹
43
44 // Write the local value back to an array
// the size of the number of groups
45 if (tiX == 0){ ❺
46 redscratch[group_id] = spad[0]; ❺
47 (*blocksum) = spad[0];
48 }
49 }
❶ Scratchpad array in CUDA shared memory
❷ Loads memory into scratchpad array
❸ Synchronizes threads before using scratchpad data
❹ Sets reduction within thread block
❺ One thread stores result for block.
我们通过让所有线程将其数据存储到 CUDA 共享内存中的暂存器数组(第 35-38 行)来开始第一遍。块中的所有线程都可以访问此共享内存。共享内存可以在一个或两个处理器周期内访问,而不是主 GPU 内存所需的数百个处理器周期。您可以将共享内存视为可编程高速缓存或暂存器内存。为了确保所有线程都已完成存储,我们在第 40 行上使用同步调用。
We start out the first pass by having all of the threads store their data into a scratchpad array in CUDA shared memory (lines 35-38). All the threads in the block can access this shared memory. Shared memory can be accessed in one or two processor cycles instead of the hundreds required for main GPU memory. You can think of shared memory as a programmable cache or as scratchpad memory. To make sure all the threads have completed the store, we use a synchronization call on line 40.
因为 block 内的 reduction sum 将在两个 reduction pass 中使用,所以我们将代码放在一个 device 子例程中,并在第 42 行调用它。设备子例程是从另一个设备子例程而不是从主机调用的子例程。在 subroutine 之后,结果和 被存储回一个较小的 scratch 数组中,我们在第二阶段读入该数组。我们还将结果存储在第 47 行,以防可以跳过第二次传递。因为我们无法访问其他线程块中的值,所以我们必须在另一个内核中的第二次传递中完成操作。在第一轮中,我们按区块大小减少了数据的长度。
Because the reduction sum within the block is going to be used in both reduction passes, we put the code in a device subroutine and call it on line 42. A device subroutine is a subroutine that is to be called from another device subroutine rather than from the host. After the subroutine, the resulting sum is stored back out into a smaller scratch array that we read in during the second phase. We also store the result on line 47 in case the second pass can be skipped. Because we cannot access the values in other thread blocks, we have to complete the operation in a second pass in another kernel. In this first pass, we have reduced the length of the data by our block size.
让我们继续看一下我们在第一轮中提到的常见设备代码。在这两个过程中,我们都需要对 CUDA 线程块进行总和缩减,因此我们将其编写为通用设备例程。下面清单中显示的代码也可以针对其他 reduction 运算符轻松修改,并且只需要对 HIP 和 OpenCL 进行少量更改。
Let’s move on to look at the common device code that we mentioned in the first pass. We will need a sum reduction for the CUDA thread block in both passes, so we write it as a general device routine. The code shown in the following listing can also be easily modified for other reduction operators and only needs small changes for HIP and OpenCL.
Listing 12.8 Common sum reduction device kernel
CUDA/SumReduction/SumReduction.cu (four parts) 1 #define MIN_REDUCE_SYNC_SIZE warpSize ❶ 2 3 __device__ void reduction_sum_within_block(double *spad) 4 { 5 const unsigned int tiX = threadIdx.x; 6 const unsigned int ntX = blockDim.x; 7 8 for (int offset = ntX >> 1; offset > MIN_REDUCE_SYNC_SIZE; offset >>= 1) { 9 if (tiX < offset) { ❷ 10 spad[tiX] = spad[tiX] + spad[tiX+offset]; 11 } 12 __syncthreads(); ❸ 13 } 14 if (tiX < MIN_REDUCE_SYNC_SIZE) { 15 for (int offset = MIN_REDUCE_SYNC_SIZE; offset > 1; offset >>= 1) { 16 spad[tiX] = spad[tiX] + spad[tiX+offset]; 17 __syncthreads(); ❸ 18 } 19 spad[tiX] = spad[tiX] + spad[tiX+1]; 20 } 21 }
CUDA/SumReduction/SumReduction.cu (four parts) 1 #define MIN_REDUCE_SYNC_SIZE warpSize ❶ 2 3 __device__ void reduction_sum_within_block(double *spad) 4 { 5 const unsigned int tiX = threadIdx.x; 6 const unsigned int ntX = blockDim.x; 7 8 for (int offset = ntX >> 1; offset > MIN_REDUCE_SYNC_SIZE; offset >>= 1) { 9 if (tiX < offset) { ❷ 10 spad[tiX] = spad[tiX] + spad[tiX+offset]; 11 } 12 __syncthreads(); ❸ 13 } 14 if (tiX < MIN_REDUCE_SYNC_SIZE) { 15 for (int offset = MIN_REDUCE_SYNC_SIZE; offset > 1; offset >>= 1) { 16 spad[tiX] = spad[tiX] + spad[tiX+offset]; 17 __syncthreads(); ❸ 18 } 19 spad[tiX] = spad[tiX] + spad[tiX+1]; 20 } 21 }
❶ CUDA defines warpSize to be 32
❷ Only use threads needed when greater than the warp size
❸ Synchronizes between every level of the pass
将从两个通道中调用的公共设备例程在第 3 行定义。它在 thread 块中执行求和缩减。例程前面的 __device__ 属性表示将从 GPU 内核调用它。例程的基本概念是 O(log n) 操作中的成对归约树,如图 12.2 所示。图中的基本归约树由第 15-18 行的代码表示。当工作集大于第 8-13 行的 warp 大小时,以及对于第 19 行的最终通道级别,我们会进行一些小的修改,以避免不必要的同步。
The common device routine that will be called from both passes is defined on line 3. It does a sum reduction within the thread block. The __device__ attribute before the routine indicates that it will be called from a GPU kernel. The basic concept of the routine is a pair-wise reduction tree in O(log n) operations as figure 12.2 shows. The basic reduction tree from the figure is represented by the code on lines 15-18. We implement some minor modifications when the working set is larger than the warp size on lines 8-13 and for the final pass level on line 19 to avoid unnecessary synchronization.
相同的成对缩减概念用于全线程块,在大多数 GPU 设备上,该块最多可以达到 1,024,但更常用的是 128 到 256。但是,如果数组大小大于 1,024,该怎么办?我们添加第二个过程,它只使用一个线程块,如下面的清单所示。
The same pair-wise reduction concept is used for the full-thread block that can be up to 1,024 on most GPU devices, though 128 to 256 is more commonly used. But what do you do if your array size is greater than 1,024? We add a second pass that uses just a single thread block as the following listing shows.
Listing 12.9 Second pass for reduction operation
CUDA/SumReduction/SumReduction.cu (four parts)
51 __global__ void reduce_sum_stage2of2(
52 const int isize,
53 double *total_sum,
54 double *redscratch)
55 {
56 extern __shared__ double spad[];
57 const unsigned int tiX = threadIdx.x;
58 const unsigned int ntX = blockDim.x;
59
60 int giX = tiX;
61
62 spad[tiX] = 0.0;
63
64 // load the sum from reduction scratch, redscratch
65 if (tiX < isize) spad[tiX] = redscratch[giX]; ❶
66
67 for (giX += ntX; giX < isize; giX += ntX) { ❷
68 spad[tiX] += redscratch[giX]; ❷
69 } ❷
70
71 __syncthreads(); ❸
72
73 reduction_sum_within_block(spad); ❹
74
75 if (tiX == 0) {
76 (*total_sum) = spad[0]; ❺
77 }
78 }CUDA/SumReduction/SumReduction.cu (four parts)
51 __global__ void reduce_sum_stage2of2(
52 const int isize,
53 double *total_sum,
54 double *redscratch)
55 {
56 extern __shared__ double spad[];
57 const unsigned int tiX = threadIdx.x;
58 const unsigned int ntX = blockDim.x;
59
60 int giX = tiX;
61
62 spad[tiX] = 0.0;
63
64 // load the sum from reduction scratch, redscratch
65 if (tiX < isize) spad[tiX] = redscratch[giX]; ❶
66
67 for (giX += ntX; giX < isize; giX += ntX) { ❷
68 spad[tiX] += redscratch[giX]; ❷
69 } ❷
70
71 __syncthreads(); ❸
72
73 reduction_sum_within_block(spad); ❹
74
75 if (tiX == 0) {
76 (*total_sum) = spad[0]; ❺
77 }
78 }
❶ Loads values into scratchpad array
❷ Loops by thread block-size increments to get all the data
❸ Synchronizes when scratchpad array is filled
❹ Calls our common block reduction routine
❺ One thread sets the total sum for return.
为避免较大的数组使用两个以上的内核,我们在第 67-69 行使用一个线程块和循环来读取任何其他数据并将其汇总到共享的暂存器中。我们使用单个线程块,因为我们可以在其中同步,从而避免了另一个内核调用的需要。如果我们使用的线程块大小为 128 并且有一个 100 万个元素数组,则循环将向共享内存中的每个位置求和大约 60 个值 (1000000/1282)。在第一次传递中,数组大小减少了 128,然后我们相加成一个大小为 128 的暂存器,得到除法的平方 128。如果我们使用更大的区块大小,例如 1,024,我们可以将循环从 60 次迭代减少到一次读取。现在我们只需调用之前使用的相同公共线程块缩减。结果将是 scratchpad 数组中的第一个值。最后一部分是从主机设置和调用这两个内核。我们将在下面的清单中看到如何完成此操作。
To avoid more than two kernels for larger arrays, we use one thread block and loop on lines 67-69 to read and sum any additional data into the shared scratchpad. We use a single thread block because we can synchronize within it, avoiding the need for another kernel call. If we are using thread block sizes of 128 and have a one million element array, the loop will sum in about 60 values into each location in shared memory (1000000/1282). The array size is reduced by 128 in the first pass and then we sum into a scratchpad that is size 128, giving us the division by 128 squared. If we use larger block sizes, such as 1,024, we could reduce the loop from 60 iterations to a single read. Now we just call the same common thread block reduction that we used before. The result will be the first value in the scratchpad array. The last part of this is to set up and call these two kernels from the host. We’ll see how this is done in the following listing.
Listing 12.10 Host code for CUDA reduction
CUDA/SumReduction/SumReduction.cu (four parts) 100 size_t blocksize = 128; ❶ 101 size_t blocksizebytes = blocksize* ❶ sizeof(double); ❶ 102 size_t global_work_size = ((nsize + blocksize - 1) /blocksize) * blocksize; 103 size_t gridsize = global_work_size/blocksize; ❶ 104 105 double *dev_x, *dev_total_sum, *dev_redscratch; 106 cudaMalloc(&dev_x, nsize*sizeof(double)); ❷ 107 cudaMalloc(&dev_total_sum, 1*sizeof(double)); ❷ 108 cudaMalloc(&dev_redscratch, ❷ gridsize*sizeof(double)); ❷ 109 110 cudaMemcpy(dev_x, x, nsize*sizeof(double), ❸ cudaMemcpyHostToDevice); ❸ 111 112 reduce_sum_stage1of2 ❹ <<<gridsize, blocksize, blocksizebytes>>> ❹ (nsize, dev_x, dev_total_sum, ❹ dev_redscratch); ❹ 113 114 if (gridsize > 1) { 115 reduce_sum_stage2of2 ❺ <<<1, blocksize, blocksizebytes>>> ❺ (nsize, dev_total_sum, dev_redscratch); ❺ 116 } 117 118 double total_sum; 119 cudaMemcpy(&total_sum, dev_total_sum, 1*sizeof(double), cudaMemcpyDeviceToHost); 120 printf("Result -- total sum %lf \n",total_sum); 121 122 cudaFree(dev_redscratch); 123 cudaFree(dev_total_sum); 124 cudaFree(dev_x);
CUDA/SumReduction/SumReduction.cu (four parts) 100 size_t blocksize = 128; ❶ 101 size_t blocksizebytes = blocksize* ❶ sizeof(double); ❶ 102 size_t global_work_size = ((nsize + blocksize - 1) /blocksize) * blocksize; 103 size_t gridsize = global_work_size/blocksize; ❶ 104 105 double *dev_x, *dev_total_sum, *dev_redscratch; 106 cudaMalloc(&dev_x, nsize*sizeof(double)); ❷ 107 cudaMalloc(&dev_total_sum, 1*sizeof(double)); ❷ 108 cudaMalloc(&dev_redscratch, ❷ gridsize*sizeof(double)); ❷ 109 110 cudaMemcpy(dev_x, x, nsize*sizeof(double), ❸ cudaMemcpyHostToDevice); ❸ 111 112 reduce_sum_stage1of2 ❹ <<<gridsize, blocksize, blocksizebytes>>> ❹ (nsize, dev_x, dev_total_sum, ❹ dev_redscratch); ❹ 113 114 if (gridsize > 1) { 115 reduce_sum_stage2of2 ❺ <<<1, blocksize, blocksizebytes>>> ❺ (nsize, dev_total_sum, dev_redscratch); ❺ 116 } 117 118 double total_sum; 119 cudaMemcpy(&total_sum, dev_total_sum, 1*sizeof(double), cudaMemcpyDeviceToHost); 120 printf("Result -- total sum %lf \n",total_sum); 121 122 cudaFree(dev_redscratch); 123 cudaFree(dev_total_sum); 124 cudaFree(dev_x);
❶ Calculates the block and grid sizes for the CUDA kernels
❷ Allocates device memory for the kernel
❸ Copies the array to the GPU device
❹ Calls the first pass of the reduction kernel
❺ If needed, calls the second pass
主机代码首先计算第 100-103 行的内核调用的大小。然后我们必须为设备数组分配内存。对于此操作,我们需要一个暂存数组,我们可以在其中存储第一个内核中每个区块的总和。我们在第 108 行将其分配为网格大小,因为这是我们拥有的区块数量。我们还需要一个共享内存暂存器数组,它是块大小的大小。我们在第 101 行计算此大小,并在第 112 行和第 115 行将其作为 chevron 运算符的第三个参数传递到内核中。第三个参数是可选参数;这是我们第一次看到它被使用。回顾一下清单 12.9(第 56 行)和清单 12.7(第 29 行),看看 GPU 设备上暂存器的相应代码的处理位置。
The host code first calculates the sizes for the kernel calls on lines 100-103. We then have to allocate the memory for the device arrays. For this operation, we need a scratch array where we can store the sums for each block from the first kernel. We allocate it on line 108 to be the grid size because that is the number of blocks that we have. We also need a shared memory scratchpad array that is the size of the block size. We calculate this size on line 101 and pass it into the kernel on lines 112 and 115 as the third parameter to the chevron operator. The third parameter is an optional parameter; this is the first time that we have seen it used. Take a look back at listing 12.9 (line 56) and listing 12.7 (line 29) to see where the corresponding code for the scratchpad is handled on the GPU device.
尝试遵循所有复杂的循环可能很困难。因此,我们创建了一个代码版本,它在 CPU 上执行相同的循环,并在运行时打印其值。它位于 CUDA/SumReductionRevealed 目录中,位于
Trying to follow all the convoluted loops can be difficult. So we have created a version of the code that does the same loops on the CPU and prints its values as it goes along. It is in the CUDA/SumReductionRevealed directory at
https://github.com/EssentialsofParallelComputing/Chapter12
https://github.com/EssentialsofParallelComputing/Chapter12
我们没有空间在这里显示所有代码,但您可能会发现在执行时浏览和打印值很有用。在以下示例中,我们显示了输出的编辑版本。
We don’t have room to show all the code here, but you might find it useful to explore and print the values as it executes. We show an edited version of the output in the following example.
我们已经将这个线程块减少作为对需要线程协作的内核的一般介绍。您可以看到这有多么复杂,尤其是与 Fortran 中的内部调用所需的单行相比。在此过程中,我们还比 CPU 获得了很多加速,并将数据保存在 GPU 上以进行此操作。此算法可以进一步优化,但您也可以考虑使用一些库服务,例如 CUDA UnBound (CUB)、Thrust 或其他 GPU 库。
We have shown this thread block reduction as a general introduction to kernels that require thread cooperation. You can see how complicated this is, especially compared to the single line needed for the intrinsic call in Fortran. In the process, we also gained a lot of speedup over the CPU and kept the data on the GPU for this operation. This algorithm can be further optimized, but you can also consider using some library services such as CUDA UnBound (CUB), Thrust, or other GPU libraries.
CUDA 代码仅在 NVIDIA GPU 上运行。但 AMD 已经实现了一种类似的 GPU 语言,并将其命名为异构可移植性接口 (HIP)。它是 AMD Radeon 开放计算平台 (ROCm) 工具套件的一部分。如果使用 HIP 语言编程,则可以调用在 NVIDIA 平台上使用 NVCC 并在 AMD GPU 上使用 HCC 的 hipcc 编译器。
CUDA code only runs on NVIDIA GPUs. But AMD has implemented a similar GPU language and named it the Heterogeneous Interface for Portability (HIP). It is part of the Radeon Open Compute platform (ROCm) suite of tools from AMD. If you program in the HIP language, you can call the hipcc compiler that uses NVCC on NVIDIA platforms and HCC on AMD GPUs.
要尝试这些示例,您可能需要安装 ROCm 软件和工具套件。安装过程经常更改,因此请查看最新说明。示例还附带了一些说明。
To try these examples, you may need to install the ROCm suite of software and tools. The install process frequently changes, so check for the latest instructions. There are some instructions that accompany the examples as well.
CMake 中也有对 HIP 的良好支持,并且从 CMake 2.8.3 版本开始提供 HIP 支持。下面的清单显示了 HIP 的典型 CMakeLists 文件。
There is also good support for HIP in CMake, and HIP support has been available since version 2.8.3 of CMake. A typical CMakeLists file for HIP is shown in the following listing.
Listing 12.11 Building A HIP program with CMake
HIP/StreamTriad/CMakeLists.txt 1 cmake_minimum_required (VERSION 2.8.3) ❶ 2 project (StreamTriad) 3 6 if(NOT DEFINED HIP_PATH) ❷ 7 if(NOT DEFINED ENV{HIP_PATH}) 8 set(HIP_PATH "/opt/rocm/hip" CACHE PATH "Path to HIP install") 9 else() 10 set(HIP_PATH $ENV{HIP_PATH} CACHE PATH "Path to HIP install") 11 endif() 12 endif() 13 set(CMAKE_MODULE_PATH "${HIP_PATH}/cmake" ${CMAKE_MODULE_PATH}) 14 15 find_package(HIP REQUIRED) ❸ 16 if(HIP_FOUND) 17 message(STATUS "Found HIP: " ${HIP_VERSION}) 20 endif() 21 22 set(CMAKE_CXX_COMPILER ${HIP_HIPCC_EXECUTABLE}) ❹ 23 set(MY_HIPCC_OPTIONS ) 24 set(MY_HCC_OPTIONS ) 25 set(MY_NVCC_OPTIONS ) 26 27 # Adds build target of StreamTriad with source code files 28 HIP_ADD_EXECUTABLE(StreamTriad StreamTriad.cc ❺ timer.c timer.h) ❺ 29 target_include_directories(StreamTriad PRIVATE ${HIP_PATH}/include) 30 target_link_directories(StreamTriad PRIVATE ${HIP_PATH}/lib) 31 target_link_libraries(StreamTriad hip_hcc) 32 33 # Cleanup 34 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles *.o 35 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt)
HIP/StreamTriad/CMakeLists.txt 1 cmake_minimum_required (VERSION 2.8.3) ❶ 2 project (StreamTriad) 3 6 if(NOT DEFINED HIP_PATH) ❷ 7 if(NOT DEFINED ENV{HIP_PATH}) 8 set(HIP_PATH "/opt/rocm/hip" CACHE PATH "Path to HIP install") 9 else() 10 set(HIP_PATH $ENV{HIP_PATH} CACHE PATH "Path to HIP install") 11 endif() 12 endif() 13 set(CMAKE_MODULE_PATH "${HIP_PATH}/cmake" ${CMAKE_MODULE_PATH}) 14 15 find_package(HIP REQUIRED) ❸ 16 if(HIP_FOUND) 17 message(STATUS "Found HIP: " ${HIP_VERSION}) 20 endif() 21 22 set(CMAKE_CXX_COMPILER ${HIP_HIPCC_EXECUTABLE}) ❹ 23 set(MY_HIPCC_OPTIONS ) 24 set(MY_HCC_OPTIONS ) 25 set(MY_NVCC_OPTIONS ) 26 27 # Adds build target of StreamTriad with source code files 28 HIP_ADD_EXECUTABLE(StreamTriad StreamTriad.cc ❺ timer.c timer.h) ❺ 29 target_include_directories(StreamTriad PRIVATE ${HIP_PATH}/include) 30 target_link_directories(StreamTriad PRIVATE ${HIP_PATH}/lib) 31 target_link_libraries(StreamTriad hip_hcc) 32 33 # Cleanup 34 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles *.o 35 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt)
❶ Minimum version of CMake for HIP is 2.8.3
❷ Sets a path to the HIP installation
❹ Sets the C++ compiler to hipcc
❺ Adds the executable, includes, and libraries
在清单中,我们首先尝试为 HIP 安装的位置设置不同的路径选项,然后在第 15 行为 HIP 调用 find_package。然后,我们在第 22 行将 C++ 编译器设置为 hipcc。HIP_ADD_EXECUTABLE 命令添加可执行文件的构建,我们使用 HIP 头文件和库的设置(第 28-31 行)来完善列表。现在让我们把注意力转向清单 12.12 中的 HIP 源。我们重点介绍了清单 12.5-12.6 中给出的源代码的 CUDA 版本的变化。
In the listing, we first try to set different path options for where the HIP install might be located and then call find_package for HIP on line 15. We then set the C++ compiler to hipcc on line 22. The HIP_ADD_EXECUTABLE command adds the build of our executable, and we round out the listing with settings for the HIP header files and libraries (lines 28-31). Now let’s turn our attention to the HIP source in listing 12.12. We highlight the changes from the CUDA version of the source code given in listings 12.5-12.6.
Listing 12.12 The HIP differences for the stream triad
HIP/StreamTriad/StreamTriad.c 1 #include "hip/hip_runtime.h" ❶ < . . . skipping . . . > 36 // allocate device memory. suffix of _d indicates a device pointer 37 double *a_d, *b_d, *c_d; 38 hipMalloc(&a_d, stream_array_size* ❷ sizeof(double)); ❷ 39 hipMalloc(&b_d, stream_array_size* ❷ sizeof(double)); ❷ 40 hipMalloc(&c_d, stream_array_size* ❷ sizeof(double)); ❷ < . . . skipping . . . > 46 for (int k=0; k<NTIMES; k++){ 47 cpu_timer_start(&ttotal); 48 // copying array data from host to device 49 hipMemcpy(a_d, a, stream_array_size* sizeof(double), hipMemcpyHostToDevice); ❸ 50 hipMemcpy(b_d, b, stream_array_size* sizeof(double), hipMemcpyHostToDevice); ❸ 51 // cuda memcopy to device returns after buffer available, 52 // so synchronize to get accurate timing for kernel only 53 hipDeviceSynchronize(); ❹ 54 55 cpu_timer_start(&tkernel); 56 // launch stream triad kernel 57 hipLaunchKernelGGL(StreamTriad, ❺ dim3(gridsize), dim3(blocksize), 0, 0, ❺ stream_array_size, scalar, a_d, b_d, ❺ c_d); ❺ 58 // need to force completion to get timing 59 hipDeviceSynchronize(); ❹ 60 tkernel_sum += cpu_timer_stop(tkernel); 61 62 // cuda memcpy from device to host blocks for completion // so no need for synchronize 63 hipMemcpy(c, c_d, stream_array_size* ❸ sizeof(double), hipMemcpyDeviceToHost); ❸ < . . . skipping . . . > 72 } < . . . skipping . . . > 75 76 hipFree(a_d); ❻ 77 hipFree(b_d); ❻ 78 hipFree(c_d); ❻
HIP/StreamTriad/StreamTriad.c 1 #include "hip/hip_runtime.h" ❶ < . . . skipping . . . > 36 // allocate device memory. suffix of _d indicates a device pointer 37 double *a_d, *b_d, *c_d; 38 hipMalloc(&a_d, stream_array_size* ❷ sizeof(double)); ❷ 39 hipMalloc(&b_d, stream_array_size* ❷ sizeof(double)); ❷ 40 hipMalloc(&c_d, stream_array_size* ❷ sizeof(double)); ❷ < . . . skipping . . . > 46 for (int k=0; k<NTIMES; k++){ 47 cpu_timer_start(&ttotal); 48 // copying array data from host to device 49 hipMemcpy(a_d, a, stream_array_size* sizeof(double), hipMemcpyHostToDevice); ❸ 50 hipMemcpy(b_d, b, stream_array_size* sizeof(double), hipMemcpyHostToDevice); ❸ 51 // cuda memcopy to device returns after buffer available, 52 // so synchronize to get accurate timing for kernel only 53 hipDeviceSynchronize(); ❹ 54 55 cpu_timer_start(&tkernel); 56 // launch stream triad kernel 57 hipLaunchKernelGGL(StreamTriad, ❺ dim3(gridsize), dim3(blocksize), 0, 0, ❺ stream_array_size, scalar, a_d, b_d, ❺ c_d); ❺ 58 // need to force completion to get timing 59 hipDeviceSynchronize(); ❹ 60 tkernel_sum += cpu_timer_stop(tkernel); 61 62 // cuda memcpy from device to host blocks for completion // so no need for synchronize 63 hipMemcpy(c, c_d, stream_array_size* ❸ sizeof(double), hipMemcpyDeviceToHost); ❸ < . . . skipping . . . > 72 } < . . . skipping . . . > 75 76 hipFree(a_d); ❻ 77 hipFree(b_d); ❻ 78 hipFree(c_d); ❻
❶ We need to include the HIP run-time header.
❷ cudaMalloc becomes hipMalloc.
❸ cudaMemcpy becomes hipMemcpy.
❹ cudaDeviceSynchronize 变为 hipDeviceSynchronize。
❹ cudaDeviceSynchronize becomes hipDeviceSynchronize.
❺ hipLaunchKernel 是一种比 CUDA 内核启动更传统的语法。
❺ hipLaunchKernel is a more traditional syntax than the CUDA kernel launch.
要从 CUDA 源转换为 HIP 源,我们将源中出现的所有 cuda 替换为 hip。唯一更重要的变化是内核启动调用,其中 HIP 使用比 CUDA 中使用的三重 V 形语法更传统的语法。奇怪的是,最大的变化是在两种语言的变量命名中使用正确的术语。
To convert from CUDA source to HIP source, we replace all occurrences of cuda in the source with hip. The only more significant change is to the kernel launch call, where HIP uses a more traditional syntax than the triple chevron used in CUDA. Oddly enough, the greatest changes are to use the correct terminology in the variable naming for the two languages.
随着对可移植 GPU 代码的迫切需求,一种新的 GPU 编程语言 OpenCL 于 2008 年出现。OpenCL 是一种开放标准 GPU 语言,可以在 NVIDIA 和 AMD/ATI 显卡以及许多其他硬件设备上运行。OpenCL 标准工作由 Apple 领导,许多其他组织参与其中。OpenCL 的一个优点是,您几乎可以将任何 C 甚至 C++ 编译器用于主机代码。对于 GPU 设备代码,OpenCL 最初基于 C99 的子集。最近,OpenCL 的 2.1 和 2.2 版本增加了 C++ 14 支持,但实现仍然不可用。
With the overwhelming need for portable GPU code, a new GPU programming language, OpenCL, emerged in 2008. OpenCL is an open standard GPU language that can run on both NVIDIA and AMD/ATI graphic cards, as well as many other hardware devices. The OpenCL standard effort was led by Apple with many other organizations involved. One of the nice things about OpenCL is that you can use virtually any C or even C++ compiler for the host code. For the GPU device code, OpenCL initially was based on a subset of C99. Recently, the 2.1 and 2.2 versions of OpenCL added C++ 14 support, but implementations are still not available.
OpenCL 版本发布时,最初令人非常兴奋。最后,这里有一种编写可移植 GPU 代码的方法。例如,GIMP 宣布它将支持 OpenCL 作为在许多硬件平台上提供 GPU 加速的一种方式。现实情况就不那么令人信服了。许多人认为 OpenCL 太低级和冗长,无法被广泛接受。甚至可能它的最终角色是作为高级语言的低级可移植性层。但是,它作为跨各种硬件设备的可移植语言的价值已经从它在嵌入式设备社区中被接受用于现场可编程门阵列 (FPGA) 中得到证明。OpenCL 被认为冗长的原因之一是设备选择更复杂(且功能更强大)。您必须检测并选择将要运行的设备。这可能相当于 100 行代码,仅用于开始。
The OpenCL release took off with a lot of initial excitement. Finally, here was a way to write portable GPU code. For example, GIMP announced that it would support OpenCL as a way for GPU acceleration to be made available on many hardware platforms. The reality has been less compelling. Many feel that OpenCL is too low-level and verbose for widespread acceptance. It may even be that its eventual role is as the low-level portability layer for higher level languages. But its value as a portable language across a diverse set of hardware devices has been demonstrated by its acceptance within the embedded device community for field-programmable gate arrays (FPGAs). One of the reasons OpenCL is thought to be verbose is that the device selection is more complicated (and powerful). You have to detect and select the device you will run on. This can amount to a hundred lines of code just to get started.
几乎每个使用 OpenCL 的人都会编写一个库来处理低级问题。我们也不例外。我们的库称为 EZCL。几乎每个 OpenCL 调用都至少用一个 light 层包装来处理错误情况。设备检测、编译代码和错误处理会消耗大量代码行。
Nearly everyone who uses OpenCL writes a library to handle the low-level concerns. We are no exception. Our library is called EZCL. Nearly every OpenCL call is wrapped with at least a light layer to handle the error conditions. Device detection, compiling code, and error handling consume a lot of lines of code.
在我们的示例中,我们将使用 EZCL 库的缩写版本,称为 EZCL_Lite,以便您可以看到实际的 OpenCL 调用。EZCL_Lite例程用于选择设备并为应用程序进行设置,然后编译设备代码并处理错误。这些操作的代码太长,无法在此处显示,因此请查看 OpenCL 目录中的示例,https://github.com/Essentialsof ParallelComputing/Chapter12。目录中还提供了完整的 EZCL 库。EZCL 例程提供调用的详细错误以及它发生在源代码中的哪一行。
We’ll use an abbreviated version of our EZCL library, called EZCL_Lite, in our examples so that you can see the actual OpenCL calls. The EZCL_Lite routines are used to select the device and set it up for the application, then compile the device code and handle the errors. The code for these operations is too long to show here, so look at the examples in the OpenCL directory at https://github.com/Essentialsof ParallelComputing/Chapter12. The full EZCL library is also available in the directory. The EZCL routines give detailed errors with calls and on which line in the source code that it occurs.
在开始尝试 OpenCL 代码之前,请检查您是否拥有正确的设置和设备。为此,您可以使用 clinfo 命令。
Before you start out trying the OpenCL code, check to see if you have the proper setup and devices. For this, you can use the clinfo command.
为合并 OpenCL 而对标准 makefile 进行的更改并不太复杂。典型的变化如列表 12.13 所示。要使用 OpenCL 的简单 makefile,请键入
The changes to a standard makefile to incorporate OpenCL are not too complicated. The typical changes are shown in listing 12.13. To use the simple makefile for OpenCL, type
ln -s Makefile.simple Makefile
ln -s Makefile.simple Makefile
然后使用 make 构建应用程序,并使用 ./StreamTriad 运行应用程序。
Then build the application with make and run the application with ./StreamTriad.
Listing 12.13 OpenCL simple makefile
OpenCL/StreamTriad/Makefile.simple 1 all: StreamTriad 2 3 #CFLAGS = -DDEVICE_DETECT_DEBUG=1 ❶ 4 #OPENCL_LIB = -L<path> 5 6 %.inc : %.cl ❷ 7 ./embed_source.pl $^ > $@ ❷ 8 9 StreamTriad.o: StreamTriad.c StreamTriad_kernel.inc 10 11 StreamTriad: StreamTriad.o timer.o ezclsmall.o 12 ${CC} -o $@ $^ ${OPENCL_LIB} -lOpenCL 13 14 clean: 15 rm -rf StreamTriad *.o StreamTriad_kernel.inc
OpenCL/StreamTriad/Makefile.simple 1 all: StreamTriad 2 3 #CFLAGS = -DDEVICE_DETECT_DEBUG=1 ❶ 4 #OPENCL_LIB = -L<path> 5 6 %.inc : %.cl ❷ 7 ./embed_source.pl $^ > $@ ❷ 8 9 StreamTriad.o: StreamTriad.c StreamTriad_kernel.inc 10 11 StreamTriad: StreamTriad.o timer.o ezclsmall.o 12 ${CC} -o $@ $^ ${OPENCL_LIB} -lOpenCL 13 14 clean: 15 rm -rf StreamTriad *.o StreamTriad_kernel.inc
❶ Turns on device detection verbosity
❷ Pattern rule embeds the OpenCL source
makefile 包含一种设置 DEVICE_DETECT_DEBUG 标志以打印出可用 GPU 设备的详细信息的方法。此标志在 ezcl_lite.c 源代码中启用更详细的内容。它有助于解决设备检测问题或获取错误的设备。第 6 行还添加了一个模式规则,该规则将 OpenCL 源嵌入到程序中,以便在运行时使用。此 Perl 脚本将源转换为注释,并作为第 9 行的依赖项。它将包含在 StreamTriad.c 文件中,并带有 include 语句。
The makefile includes a way to set the DEVICE_DETECT_DEBUG flag to print out detailed information on the GPU devices available. This flag turns on more verbosity in the ezcl_lite.c source code. It can be helpful for fixing problems with device detection or getting the wrong device. There is also the addition of a pattern rule on line 6 that embeds the OpenCL source into the program for use at run time. This Perl script converts the source into a comment and as a dependency on line 9. It will be included in the StreamTriad.c file with an include statement.
embed_source.pl 实用程序是我们开发的实用程序,用于将 OpenCL 源代码直接链接到可执行文件。(有关此实用程序的源代码,请参阅章节 示例.)OpenCL 代码运行的常见方式是拥有一个单独的源文件,该文件必须在运行时找到,然后在知道器件后进行编译。使用单独的文件会产生无法找到或获取错误文件版本的问题。我们强烈建议将源代码嵌入到可执行文件中,以避免这些问题。我们还可以在构建系统中使用 CMake 对 OpenCL 的支持,如下面的清单所示。
The embed_source.pl utility is one that we developed to link the OpenCL source directly into the executable. (See the chapter examples for the source to this utility.) The common way for OpenCL code to function is to have a separate source file that must be located at run time, which is then compiled once the device is known. Using a separate file creates problems with it not being able to be found or getting the wrong version of the file. We strongly recommend embedding the source into the executable to avoid these problems. We can also use CMake support for OpenCL in our build system as the following listing shows.
Listing 12.14 OpenCL CMake file
OpenCL/StreamTriad/CMakeLists.txt 1 cmake_minimum_required (VERSION 3.1) ❶ 2 project (StreamTriad) 3 4 if (DEVICE_DETECT_DEBUG) ❷ 5 add_definitions(-DDEVICE_DETECT_DEBUG=1) ❷ 6 endif (DEVICE_DETECT_DEBUG) ❷ 7 8 find_package(OpenCL REQUIRED) ❶ 9 set(HAVE_CL_DOUBLE ON CACHE BOOL ❸ "Have OpenCL Double") ❸ 10 set(NO_CL_DOUBLE OFF) ❸ 11 include_directories(${OpenCL_INCLUDE_DIRS}) 12 13 # Adds build target of StreamTriad with source code files 14 add_executable(StreamTriad StreamTriad.c ezclsmall.c ezclsmall.h timer.c timer.h) 15 target_link_libraries(StreamTriad ${OpenCL_LIBRARIES}) 16 add_dependencies(StreamTriad StreamTriad_kernel_source) 17 18 ########### embed source target ############## ❹ 19 add_custom_command(OUTPUT ❹ ${CMAKE_CURRENT_BINARY_DIR}/StreamTriad_kernel.inc ❹ 20 COMMAND ${CMAKE_SOURCE_DIR}/embed_source.pl ❹ ${CMAKE_SOURCE_DIR}/StreamTriad_kernel.cl ❹ > StreamTriad_kernel.inc ❹ 21 DEPENDS StreamTriad_kernel.cl ${CMAKE_SOURCE_DIR}/embed_source.pl)❹ 22 add_custom_target( StreamTriad_kernel_source ALL DEPENDS ❹ ${CMAKE_CURRENT_BINARY_DIR}/ StreamTriad_kernel.inc) ❹ 23 24 # Cleanup 25 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles 26 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt) 27 28 SET_DIRECTORY_PROPERTIES(PROPERTIES ADDITIONAL_MAKE_CLEAN_FILES "StreamTriad_kernel.inc")
OpenCL/StreamTriad/CMakeLists.txt 1 cmake_minimum_required (VERSION 3.1) ❶ 2 project (StreamTriad) 3 4 if (DEVICE_DETECT_DEBUG) ❷ 5 add_definitions(-DDEVICE_DETECT_DEBUG=1) ❷ 6 endif (DEVICE_DETECT_DEBUG) ❷ 7 8 find_package(OpenCL REQUIRED) ❶ 9 set(HAVE_CL_DOUBLE ON CACHE BOOL ❸ "Have OpenCL Double") ❸ 10 set(NO_CL_DOUBLE OFF) ❸ 11 include_directories(${OpenCL_INCLUDE_DIRS}) 12 13 # Adds build target of StreamTriad with source code files 14 add_executable(StreamTriad StreamTriad.c ezclsmall.c ezclsmall.h timer.c timer.h) 15 target_link_libraries(StreamTriad ${OpenCL_LIBRARIES}) 16 add_dependencies(StreamTriad StreamTriad_kernel_source) 17 18 ########### embed source target ############## ❹ 19 add_custom_command(OUTPUT ❹ ${CMAKE_CURRENT_BINARY_DIR}/StreamTriad_kernel.inc ❹ 20 COMMAND ${CMAKE_SOURCE_DIR}/embed_source.pl ❹ ${CMAKE_SOURCE_DIR}/StreamTriad_kernel.cl ❹ > StreamTriad_kernel.inc ❹ 21 DEPENDS StreamTriad_kernel.cl ${CMAKE_SOURCE_DIR}/embed_source.pl)❹ 22 add_custom_target( StreamTriad_kernel_source ALL DEPENDS ❹ ${CMAKE_CURRENT_BINARY_DIR}/ StreamTriad_kernel.inc) ❹ 23 24 # Cleanup 25 add_custom_target(distclean COMMAND rm -rf CMakeCache.txt CMakeFiles 26 Makefile cmake_install.cmake StreamTriad.dSYM ipo_out.optrpt) 27 28 SET_DIRECTORY_PROPERTIES(PROPERTIES ADDITIONAL_MAKE_CLEAN_FILES "StreamTriad_kernel.inc")
❶ CMake 在 3.1 版中添加了 OpenCL 支持。
❶ CMake added OpenCL support with version 3.1.
❷ Turns on device detection verbosity
❹ Custom command embeds OpenCL source into executable
CMake 中的 OpenCL 支持是在 3.1 版中增加的。我们在第 1 行的 CMakelists.txt 文件顶部添加了此版本要求。还有一些其他特殊的事情需要注意。在此示例中,我们使用 CMake 命令的 -DDEVICE_DETECT_DEBUG=1 选项来启用设备检测的详细程度。此外,我们还提供了一种打开和关闭对 OpenCL 双精度支持的方法。我们在 EZCL_Lite 代码中使用它来设置 OpenCL 设备代码的即时 (JIT) 编译标志。最后,我们在第 19-22 行添加了一个自定义命令,用于将 OpenCL 设备源嵌入到可执行文件中。OpenCL 内核的源代码位于一个名为 StreamTriad_kernel.cl 的单独文件中,如下面的清单所示。
OpenCL support in CMake was added at version 3.1. We added this version requirement at the top of the CMakelists.txt file on line 1. There are a few other special things to note. For this example, we used the -DDEVICE_DETECT_DEBUG=1 option to the CMake command to turn on the verbosity for the device detection. Also, we included a way to turn on and off support for OpenCL double precision. We used this in the EZCL_Lite code to set the just-in-time (JIT) compile flag for the OpenCL device code. Last, we added a custom command in lines 19-22 for embedding the OpenCL device source into the executable. The source code for the OpenCL kernel is in a separate file called StreamTriad_kernel.cl as shown in the following listing.
OpenCL/StreamTriad/StreamTriad_kernel.cl 1 // OpenCL kernel version of stream triad 2 __kernel void StreamTriad( ❶ 3 const int n, 4 const double scalar, 5 __global const double *a, 6 __global const double *b, 7 __global double *c) 8 { 9 int i = get_global_id(0); ❷ 10 11 // Protect from going out-of-bounds 12 if (i >= n) return; 13 14 c[i] = a[i] + scalar*b[i]; 15 }
OpenCL/StreamTriad/StreamTriad_kernel.cl 1 // OpenCL kernel version of stream triad 2 __kernel void StreamTriad( ❶ 3 const int n, 4 const double scalar, 5 __global const double *a, 6 __global const double *b, 7 __global double *c) 8 { 9 int i = get_global_id(0); ❷ 10 11 // Protect from going out-of-bounds 12 if (i >= n) return; 13 14 c[i] = a[i] + scalar*b[i]; 15 }
❶ __kernel attribute indicates this is called from the host.
将此内核代码与清单 12.4 中 CUDA 的内核代码进行比较。OpenCL 代码几乎相同,只是 __kernel 替换了子例程声明上的 __global__,将 __global 属性添加到指针参数中,并且有另一种获取线程索引的方法。此外, CUDA 内核代码与主机的源位于同一 .cu 文件中,而 OpenCL 代码位于单独的 .cl 文件中。我们可以将 CUDA 代码分离到它自己的 .cu 文件中,并将主机代码放在标准 C++ 源文件中。这类似于我们用于 OpenCL 应用程序的结构。
Compare this kernel code to the kernel code for CUDA in listing 12.4. The OpenCL code is nearly identical except that __kernel replaces __global__ on the subroutine declaration, the __global attribute is added to the pointer arguments, and there’s a different way of getting the thread index. Also, the CUDA kernel code is in the same .cu file as the source for the host, while the OpenCL code is in a separate .cl file. We could have separated out the CUDA code into its own .cu file and put the host code in a standard C++ source file. This would be similar to the structure we use for our OpenCL application.
注意CUDA 和 OpenCL 的内核代码之间的许多差异都是肤浅的。
Note So many of the differences between the kernel codes for CUDA and OpenCL are superficial.
那么 OpenCL 主机端代码与 CUDA 版本有何不同呢?让我们看一下清单 12.16 中的 OpenCl 版本,并将其与清单 12.5 中的代码进行比较。OpenCL 流三元组有两个版本:不带错误检查的 StreamTriad_simple.c 和带错误检查的 StreamTriad.c。错误检查添加了许多代码行,这些代码行最初只是妨碍了理解发生了什么。
So how different is the OpenCL host-side code from the CUDA version? Let’s take a look at the OpenCl version in listing 12.16 and compare it to the code in listing 12.5. There are two versions of the OpenCL stream triad: StreamTriad_simple.c without error checking and StreamTriad.c with error checking. The error checking adds a lot of lines of code that initially just get in the way of understanding what is going on.
清单 12.16 流三元组的 OpenCL 版本:设置和拆除
Listing 12.16 OpenCL version of stream triad: Set up and tear down
OpenCL/StreamTriad/StreamTriad_simple.c 5 #include "StreamTriad_kernel.inc" 6 #ifdef __APPLE_CC__ ❶ 7 #include <OpenCL/OpenCL.h> ❶ 8 #else ❶ 9 #include <CL/cl.h> ❶ 10 #endif ❶ 11 #include "ezcl_lite.h" ❷ < . . . skipping code . . . > 32 cl_command_queue command_queue; 33 cl_context context; 34 iret = ezcl_devtype_init( ❸ CL_DEVICE_TYPE_GPU, &command_queue, ❸ &context); ❸ 35 const char *defines = NULL; 36 cl_program program = ❹ ezcl_create_program_wsource(context, ❹ defines, StreamTriad_kernel_source); ❹ 37 cl_kernel kernel_StreamTriad = ❺ clCreateKernel(program, "StreamTriad", ❺ &iret); ❺ 38 39 // allocate device memory. suffix of _d indicates a device pointer 40 size_t nsize = stream_array_size*sizeof(double); 41 cl_mem a_d = clCreateBuffer(context, ❻ CL_MEM_READ_WRITE, nsize, NULL, &iret); ❻ 42 cl_mem b_d = clCreateBuffer(context, ❻ CL_MEM_READ_WRITE, nsize, NULL, &iret); ❻ 43 cl_mem c_d = clCreateBuffer(context, ❻ CL_MEM_READ_WRITE, nsize, NULL, &iret); ❻ 44 45 // setting work group size and padding // to get even number of workgroups 46 size_t local_work_size = 512; ❼ 47 size_t global_work_size = ( (stream_array_size + local_work_size - 1) ❼ /local_work_size ) * local_work_size; ❼ < . . . skipping code . . . > 74 clReleaseMemObject(a_d); ❻ 75 clReleaseMemObject(b_d); ❻ 76 clReleaseMemObject(c_d); ❻ 77 78 clReleaseKernel(kernel_StreamTriad); ❽ 79 clReleaseCommandQueue(command_queue); ❽ 80 clReleaseContext(context); ❽ 81 clReleaseProgram(program); ❽
OpenCL/StreamTriad/StreamTriad_simple.c 5 #include "StreamTriad_kernel.inc" 6 #ifdef __APPLE_CC__ ❶ 7 #include <OpenCL/OpenCL.h> ❶ 8 #else ❶ 9 #include <CL/cl.h> ❶ 10 #endif ❶ 11 #include "ezcl_lite.h" ❷ < . . . skipping code . . . > 32 cl_command_queue command_queue; 33 cl_context context; 34 iret = ezcl_devtype_init( ❸ CL_DEVICE_TYPE_GPU, &command_queue, ❸ &context); ❸ 35 const char *defines = NULL; 36 cl_program program = ❹ ezcl_create_program_wsource(context, ❹ defines, StreamTriad_kernel_source); ❹ 37 cl_kernel kernel_StreamTriad = ❺ clCreateKernel(program, "StreamTriad", ❺ &iret); ❺ 38 39 // allocate device memory. suffix of _d indicates a device pointer 40 size_t nsize = stream_array_size*sizeof(double); 41 cl_mem a_d = clCreateBuffer(context, ❻ CL_MEM_READ_WRITE, nsize, NULL, &iret); ❻ 42 cl_mem b_d = clCreateBuffer(context, ❻ CL_MEM_READ_WRITE, nsize, NULL, &iret); ❻ 43 cl_mem c_d = clCreateBuffer(context, ❻ CL_MEM_READ_WRITE, nsize, NULL, &iret); ❻ 44 45 // setting work group size and padding // to get even number of workgroups 46 size_t local_work_size = 512; ❼ 47 size_t global_work_size = ( (stream_array_size + local_work_size - 1) ❼ /local_work_size ) * local_work_size; ❼ < . . . skipping code . . . > 74 clReleaseMemObject(a_d); ❻ 75 clReleaseMemObject(b_d); ❻ 76 clReleaseMemObject(c_d); ❻ 77 78 clReleaseKernel(kernel_StreamTriad); ❽ 79 clReleaseCommandQueue(command_queue); ❽ 80 clReleaseContext(context); ❽ 81 clReleaseProgram(program); ❽
❷ Our EZCL_Lite support library
❹ Creates the program from the source
❺ Compiles the StreamTriad kernel in the source
❼ Work group size calculation is similar to CUDA.
❽ Cleans up kernel and device-related objects
在程序开始时,我们在第 34-37 行遇到了一些真正的差异,我们必须找到我们的 GPU 设备并编译我们的设备代码。这是在 CUDA 的幕后为我们完成的。其中两行 OpenCL 代码调用我们的 EZCL_Lite 例程来检测设备并创建程序对象。我们进行这些调用是因为这些函数所需的代码量太长,无法在此处显示。这些例程的源有数百行长,尽管其中大部分是错误检查。
At the start of the program, we encounter some real differences at lines 34-37, where we have to find our GPU device and compile our device code. This is done for us behind the scenes in CUDA. Two of the lines of OpenCL code call our EZCL_Lite routines to detect the device and to create the program object. We made these calls because the amount of code required for these functions is too long to show here. The source for these routines is hundreds of lines long, though much of it is error checking.
注意源代码与章节示例一起位于 OpenCL/ StreamTriad 目录中 https://github.com/EssentialsofParallelComputing/ 第 12 章。一些错误检查代码在短版本 StreamTriad_simple.c 中被遗漏了,但它在文件 StreamTriad.c 中的长代码中。
Note The source is available with the chapter examples in the OpenCL/ StreamTriad directory at https://github.com/EssentialsofParallelComputing/ Chapter12. Some of the error checking code has been left out of the short version, StreamTriad_simple.c, but it is in the long version of the code in the file StreamTriad.c.
其余的设置和拆卸代码遵循我们在 CUDA 代码中看到的相同模式,需要进行更多清理,同样与设备和程序源处理有关。现在,清单 12.16 中在计时循环中调用 OpenCL 内核的代码部分与清单 12.6 中的 CUDA 代码相比如何?
The rest of the set up and tear down code follows the same pattern that we saw in the CUDA code, with a little more cleanup required, again related to the device and program source handling. Now, how does the section of code that calls the OpenCL kernel in the timing loop in listing 12.16 compare to the CUDA code from listing 12.6?
列表 12.17 流三元组的 OpenCL 版本:内核调用和定时循环
Listing 12.17 OpenCL version of stream triad: Kernel call and timing loop
OpenCL/StreamTriad/StreamTriad_simple.c
49 for (int k=0; k<NTIMES; k++){
50 cpu_timer_start(&ttotal);
51 // copying array data from host to device
52 iret=clEnqueueWriteBuffer(command_queue, ❶
a_d, CL_FALSE, 0, nsize, &a[0], ❶
0, NULL, NULL); ❶
53 iret=clEnqueueWriteBuffer(command_queue, ❶
b_d, CL_TRUE, 0, nsize, &b[0], ❶
0, NULL, NULL); ❶
54
55 cpu_timer_start(&tkernel);
56 // set stream triad kernel arguments
57 iret=clSetKernelArg(kernel_StreamTriad, ❷
0, sizeof(cl_int), ❷
(void *)&stream_array_size); ❷
58 iret=clSetKernelArg(kernel_StreamTriad, ❷
1, sizeof(cl_double), ❷
(void *)&scalar); ❷
59 iret=clSetKernelArg(kernel_StreamTriad, ❷
2, sizeof(cl_mem), (void *)&a_d); ❷
60 iret=clSetKernelArg(kernel_StreamTriad, ❷
3, sizeof(cl_mem), (void *)&b_d); ❷
61 iret=clSetKernelArg(kernel_StreamTriad, ❷
4, sizeof(cl_mem), (void *)&c_d); ❷
62 // call stream triad kernel
63 clEnqueueNDRangeKernel(command_queue, ❸
kernel_StreamTriad, 1, NULL, ❸
&global_work_size, &local_work_size, ❸
0, NULL, NULL); ❸
64 // need to force completion to get timing
65 clEnqueueBarrier(command_queue);
66 tkernel_sum += cpu_timer_stop(tkernel);
67
68 iret=clEnqueueReadBuffer(command_queue, ❹
c_d, CL_TRUE, 0, nsize, c, ❹
0, NULL, NULL); ❹
69 ttotal_sum += cpu_timer_stop(ttotal);
70 }OpenCL/StreamTriad/StreamTriad_simple.c
49 for (int k=0; k<NTIMES; k++){
50 cpu_timer_start(&ttotal);
51 // copying array data from host to device
52 iret=clEnqueueWriteBuffer(command_queue, ❶
a_d, CL_FALSE, 0, nsize, &a[0], ❶
0, NULL, NULL); ❶
53 iret=clEnqueueWriteBuffer(command_queue, ❶
b_d, CL_TRUE, 0, nsize, &b[0], ❶
0, NULL, NULL); ❶
54
55 cpu_timer_start(&tkernel);
56 // set stream triad kernel arguments
57 iret=clSetKernelArg(kernel_StreamTriad, ❷
0, sizeof(cl_int), ❷
(void *)&stream_array_size); ❷
58 iret=clSetKernelArg(kernel_StreamTriad, ❷
1, sizeof(cl_double), ❷
(void *)&scalar); ❷
59 iret=clSetKernelArg(kernel_StreamTriad, ❷
2, sizeof(cl_mem), (void *)&a_d); ❷
60 iret=clSetKernelArg(kernel_StreamTriad, ❷
3, sizeof(cl_mem), (void *)&b_d); ❷
61 iret=clSetKernelArg(kernel_StreamTriad, ❷
4, sizeof(cl_mem), (void *)&c_d); ❷
62 // call stream triad kernel
63 clEnqueueNDRangeKernel(command_queue, ❸
kernel_StreamTriad, 1, NULL, ❸
&global_work_size, &local_work_size, ❸
0, NULL, NULL); ❸
64 // need to force completion to get timing
65 clEnqueueBarrier(command_queue);
66 tkernel_sum += cpu_timer_stop(tkernel);
67
68 iret=clEnqueueReadBuffer(command_queue, ❹
c_d, CL_TRUE, 0, nsize, c, ❹
0, NULL, NULL); ❹
69 ttotal_sum += cpu_timer_stop(ttotal);
70 }
57-61 号线路上发生了什么情况?OpenCL 要求对每个内核参数进行单独调用。如果我们检查每个的返回码,则行数会更多。这比 CUDA 版本中清单 12.6 中的单行 53 要冗长得多。但是这两个版本之间存在直接的对应关系。OpenCL 在描述传递参数的操作时更加冗长。除了设备检测和程序编译外,这些程序的操作都很相似。最大的区别是两种语言使用的语法。
What is happening on lines 57-61? OpenCL requires a separate call for every kernel argument. If we check the return code from each, it is even more lines. This is a lot more verbose than the single line 53 in listing 12.6 in the CUDA version. But there is a direct correspondence between the two versions. OpenCL is just more verbose in describing the operations to pass the arguments. Except for the device detection and program compilation, the programs are similar in their operations. The biggest difference is the syntax used in the two languages.
在示例 12.18 中,我们展示了设备检测和 create program 调用的粗略调用序列。使这些例程较长的是错误检查和特殊情况所需的处理。对于这两个函数,具有良好的错误处理非常重要。我们需要编译器报告,以了解源代码中的错误或是否获得了错误的 GPU 设备。
In listing 12.18, we show a rough call sequence for the device detection and the create program calls. What makes these routines long is the error checking and the handling required for special cases. For these two functions, it is important to have good error handling. We need the compiler report for an error in our source code or if it got the wrong GPU device.
Listing 12.18 OpenCL support library ezcl_lite
OpenCL/StreamTriad/ezcl_lite.c
/* init and finish routine */
cl_int ezcl_devtype_init(cl_device_type device_type,
cl_command_queue *command_queue, cl_context *context);
clGetPlatformIDs -- first to get number of platforms and allocate
clGetPlatformIDs -- now get platforms
Loop on number of platforms and
clGetDeviceIDs -- once to get number of devices and allocate
clGetDeviceIDs -- get devices
check for double precision support -- clGetDeviceInfo
End loop
clCreateContext
clCreateCommandQueue
/* kernel and program routines */
cl_program ezcl_create_program_wsource(cl_context context,
const char *defines, const char *source);
clCreateProgramWithSource
set a compile string (hardware specific options)
clBuildProgram
Check for error, if found
clGetProgramBuildInfo
and printout compile report
End error handling OpenCL/StreamTriad/ezcl_lite.c
/* init and finish routine */
cl_int ezcl_devtype_init(cl_device_type device_type,
cl_command_queue *command_queue, cl_context *context);
clGetPlatformIDs -- first to get number of platforms and allocate
clGetPlatformIDs -- now get platforms
Loop on number of platforms and
clGetDeviceIDs -- once to get number of devices and allocate
clGetDeviceIDs -- get devices
check for double precision support -- clGetDeviceInfo
End loop
clCreateContext
clCreateCommandQueue
/* kernel and program routines */
cl_program ezcl_create_program_wsource(cl_context context,
const char *defines, const char *source);
clCreateProgramWithSource
set a compile string (hardware specific options)
clBuildProgram
Check for error, if found
clGetProgramBuildInfo
and printout compile report
End error handling
在结束本次关于 OpenCL 的演讲时,我们向为其创建的多种语言界面致敬。有 C++、Python、Perl 和 Java 版本。在这些语言中的每一种中,都创建了一个更高级别的接口,该接口隐藏了 OpenCL C 版本中的一些细节。而且,我们强烈建议使用我们的 EZCL 库或许多其他 OpenCL 中间件库之一。
We conclude this presentation on OpenCL with a nod to the many language interfaces that have been created for it. There are a C++, Python, Perl, and Java versions. In each of these languages, a higher-level interface has been created that hides some of the details in the C version of OpenCL. And, we highly recommend the use of our EZCL library or one of the many other middleware libraries for OpenCL.
自 OpenCL v1.2 以来,一直有一个非官方的 C++ 版本可用。该实现只是 OpenCL C 版本之上的一个薄层。尽管未能获得标准委员会的批准,但它完全可供开发人员使用。它可在 https://github.com/KhronosGroup/OpenCL-CLHPP 购买。OpenCL 中 C++ 的正式批准最近才刚刚生效,但我们仍在等待实现。
There has been an unofficial C++ version available since OpenCL v1.2. The implementation is just a thin layer on top of the C version of OpenCL. Despite failure to get approval by the standards committee, it is completely usable by developers. It is available at https://github.com/KhronosGroup/OpenCL-CLHPP. The formal approval of C++ in OpenCL has only recently occurred, but we are still waiting on implementations.
OpenCL 中的总和减少类似于 CUDA 中的减少。我们不是单步执行代码,而是只查看内核源代码中的差异。图 12.3 首先显示的是 sum_within_block 的并排差异,这是两个内核的公共例程。
The sum reduction in OpenCL is similar to that in CUDA. Rather than step through the code, we’ll just look at the differences in the kernel source. Shown first in figure 12.3 is the side-by-side difference of the sum_within_block, the common routine by both kernels.
另一个内核调用的这个设备内核的差异从声明上的属性开始。CUDA 需要在声明上使用 __device__ 属性,而 OpenCL 不需要。对于参数,传入 scratchpad 数组需要一个 CUDA 不需要的 __local 属性。下一个区别是获取本地线程索引和块(平铺)大小的语法(第 5 行和第 6 行的图 12.3)。同步调用也不同。在例程的顶部,warp 大小由宏定义,以帮助实现 NVIDIA 和 AMD GPU 之间的可移植性。CUDA 将其定义为 warp-size 变量。对于 OpenCL,它与编译器定义一起传入。我们还在实际代码中将术语从 block 更改为 tile,以与每种语言的术语保持一致。
The difference in this device kernel called by another kernel begins with the attributes on the declaration. CUDA requires a __device__attribute on the declaration, while OpenCL does not. For the arguments, passing in the scratchpad array requires a __local attribute that CUDA does not need. The next difference is the syntax for getting the local thread index and block (tile) size (figure 12.3 on lines 5 and 6). The synchronization calls are also different. At the top of the routine, a warp size is defined by a macro to help with portability between NVIDIA and AMD GPUs. CUDA defines this as a warp-size variable. For OpenCL, it is passed in with a compiler define. We also change the terminology from block to tile in the actual code to stay consistent with each language’s terminology.
下一个例程是两个内核传递中的第一个,在图 12.4 中称为 stage1of 2。此内核从主机调用。CUDA 的 __global__ 属性变为 OpenCL 的 __kernel。我们还必须将 __global 属性添加到 OpenCL 的指针参数中。
The next routine is the first of two kernel passes, called stage1of 2, in figure 12.4. This kernel is called from the host. The __global__ attribute for CUDA becomes __kernel for OpenCL. We also have to add the __global attribute to the pointer arguments for OpenCL.
下一个差异是需要注意的重要差异。在 CUDA 中,我们将共享内存中的暂存器声明为 extern __shared__ 内核主体中的变量。在主机端,此共享内存空间的大小在三个 V 形括号中的可选第三个参数中以字节数表示。OpenCL 的做法不同。它作为参数列表中的最后一个参数传递,具有 __local 属性。在主机端,内存在第四个 kernel 参数的 set 参数调用中指定:
The next difference is an important one to take note of. In CUDA, we declare the scratchpad in shared memory as an extern __shared__ variable in the body of the kernel. On the host side, the size of this shared memory space is given as a number of bytes in the optional third argument in the triple chevron brackets. OpenCL does this differently. It is passed as the last argument in the argument list with the __local attribute. On the host side, the memory is specified in the set argument call for the fourth kernel argument:
clSetKernelArg(reduce_sum_1of2, 4,
local_work_size*sizeof(cl_double), NULL);clSetKernelArg(reduce_sum_1of2, 4,
local_work_size*sizeof(cl_double), NULL);
size 是调用中的第三个参数。其余的更改位于设置线程参数和同步调用的语法中。比较的最后一部分是图 12.5 中和约简核的第二遍。
The size is the third argument in the call. The rest of the changes are in the syntax to set the thread parameters and the synchronization call. The last part of the comparison is the second pass of the sum reduction kernel in figure 12.5.
我们已经在第二个内核中看到了所有的变化模式。我们在 kernel 的声明和参数方面仍然存在差异。本地暂存数组也与第一次传递的内核具有相同的差异。thread 参数和 synchronization 也具有相同的预期差异。
We’ve already seen all of the change patterns in the second kernel. We still have the differences in the declaration of the kernel and the arguments. The local scratch array also has the same differences as the kernel for the first pass. The thread parameters and the synchronization also have the same expected differences.
回顾图 12.3-5 中的三个比较,我们不必注意的是显而易见的。内核的主体本质上是相同的。唯一的区别是同步调用的语法。OpenCL 中总和缩减的主机端代码显示在下面的清单中。
Looking back at the three comparisons in figures 12.3-5, it is what we didn’t have to note that becomes apparent. The bodies of the kernels are essentially the same. The only difference is the syntax for the synchronization call. The host side code for the sum reduction in OpenCL is shown in the following listing.
Listing 12.19 Host code for the OpenCL sum reduction
OpenCL/SumReduction/SumReduction.c
20 cl_context context;
21 cl_command_queue command_queue;
22 ezcl_devtype_init(CL_DEVICE_TYPE_GPU, &command_queue, &context);
23
24 const char *defines = NULL;
25 cl_program program = ezcl_create_program_wsource(context, defines,
SumReduction_kernel_source);
26 cl_kernel reduce_sum_1of2=clCreateKernel( ❶
program, "reduce_sum_stage1of2_cl", &iret); ❶
27 cl_kernel reduce_sum_2of2=clCreateKernel( ❶
program, "reduce_sum_stage2of2_cl", &iret); ❶
28
29 struct timespec tstart_cpu;
30 cpu_timer_start(&tstart_cpu);
31
32 size_t local_work_size = 128;
33 size_t global_work_size = ((nsize + local_work_size - 1)
/local_work_size) * local_work_size;
34 size_t nblocks = global_work_size/local_work_size;
35
36 cl_mem dev_x = clCreateBuffer(context, CL_MEM_READ_WRITE,
nsize*sizeof(double), NULL, &iret);
37 cl_mem dev_total_sum = clCreateBuffer(context, CL_MEM_READ_WRITE,
1*sizeof(double), NULL, &iret);
38 cl_mem dev_redscratch = clCreateBuffer(context, CL_MEM_READ_WRITE,
nblocks*sizeof(double), NULL, &iret);
39
40 clEnqueueWriteBuffer(command_queue, dev_x, CL_TRUE, 0,
nsize*sizeof(cl_double), &x[0], 0, NULL, NULL);
41
42 clSetKernelArg(reduce_sum_1of2, 0, ❷
sizeof(cl_int), (void *)&nsize); ❷
43 clSetKernelArg(reduce_sum_1of2, 1, ❷
sizeof(cl_mem), (void *)&dev_x); ❷
44 clSetKernelArg(reduce_sum_1of2, 2, ❷
sizeof(cl_mem), (void *)&dev_total_sum); ❷
45 clSetKernelArg(reduce_sum_1of2, 3, ❷
sizeof(cl_mem), (void *)&dev_redscratch); ❷
46 clSetKernelArg(reduce_sum_1of2, 4, ❷
local_work_size*sizeof(cl_double), NULL); ❷
47
48 clEnqueueNDRangeKernel(command_queue, ❷
reduce_sum_1of2, 1, NULL, &global_work_size, ❷
&local_work_size, 0, NULL, NULL); ❷
49
50 if (nblocks > 1) { ❸
51 clSetKernelArg(reduce_sum_2of2, 0, ❹
sizeof(cl_int), (void *)&nblocks); ❹
52 clSetKernelArg(reduce_sum_2of2, 1, ❹
sizeof(cl_mem), (void *)&dev_total_sum); ❹
53 clSetKernelArg(reduce_sum_2of2, 2, ❹
sizeof(cl_mem), (void *)&dev_redscratch); ❹
54 clSetKernelArg(reduce_sum_2of2, 3, ❹
local_work_size*sizeof(cl_double), NULL); ❹
55
56 clEnqueueNDRangeKernel(command_queue, ❹
reduce_sum_2of2, 1, NULL, &local_work_size, ❹
&local_work_size, 0, NULL, NULL); ❹
57 }
58
59 double total_sum;
60
61 iret=clEnqueueReadBuffer(command_queue, dev_total_sum, CL_TRUE, 0,
1*sizeof(cl_double), &total_sum, 0, NULL, NULL);
62
63 printf("Result -- total sum %lf \n",total_sum);
64
65 clReleaseMemObject(dev_x);
66 clReleaseMemObject(dev_redscratch);
67 clReleaseMemObject(dev_total_sum);
68
69 clReleaseKernel(reduce_sum_1of2);
70 clReleaseKernel(reduce_sum_2of2);
71 clReleaseCommandQueue(command_queue);
72 clReleaseContext(context);
73 clReleaseProgram(program);OpenCL/SumReduction/SumReduction.c
20 cl_context context;
21 cl_command_queue command_queue;
22 ezcl_devtype_init(CL_DEVICE_TYPE_GPU, &command_queue, &context);
23
24 const char *defines = NULL;
25 cl_program program = ezcl_create_program_wsource(context, defines,
SumReduction_kernel_source);
26 cl_kernel reduce_sum_1of2=clCreateKernel( ❶
program, "reduce_sum_stage1of2_cl", &iret); ❶
27 cl_kernel reduce_sum_2of2=clCreateKernel( ❶
program, "reduce_sum_stage2of2_cl", &iret); ❶
28
29 struct timespec tstart_cpu;
30 cpu_timer_start(&tstart_cpu);
31
32 size_t local_work_size = 128;
33 size_t global_work_size = ((nsize + local_work_size - 1)
/local_work_size) * local_work_size;
34 size_t nblocks = global_work_size/local_work_size;
35
36 cl_mem dev_x = clCreateBuffer(context, CL_MEM_READ_WRITE,
nsize*sizeof(double), NULL, &iret);
37 cl_mem dev_total_sum = clCreateBuffer(context, CL_MEM_READ_WRITE,
1*sizeof(double), NULL, &iret);
38 cl_mem dev_redscratch = clCreateBuffer(context, CL_MEM_READ_WRITE,
nblocks*sizeof(double), NULL, &iret);
39
40 clEnqueueWriteBuffer(command_queue, dev_x, CL_TRUE, 0,
nsize*sizeof(cl_double), &x[0], 0, NULL, NULL);
41
42 clSetKernelArg(reduce_sum_1of2, 0, ❷
sizeof(cl_int), (void *)&nsize); ❷
43 clSetKernelArg(reduce_sum_1of2, 1, ❷
sizeof(cl_mem), (void *)&dev_x); ❷
44 clSetKernelArg(reduce_sum_1of2, 2, ❷
sizeof(cl_mem), (void *)&dev_total_sum); ❷
45 clSetKernelArg(reduce_sum_1of2, 3, ❷
sizeof(cl_mem), (void *)&dev_redscratch); ❷
46 clSetKernelArg(reduce_sum_1of2, 4, ❷
local_work_size*sizeof(cl_double), NULL); ❷
47
48 clEnqueueNDRangeKernel(command_queue, ❷
reduce_sum_1of2, 1, NULL, &global_work_size, ❷
&local_work_size, 0, NULL, NULL); ❷
49
50 if (nblocks > 1) { ❸
51 clSetKernelArg(reduce_sum_2of2, 0, ❹
sizeof(cl_int), (void *)&nblocks); ❹
52 clSetKernelArg(reduce_sum_2of2, 1, ❹
sizeof(cl_mem), (void *)&dev_total_sum); ❹
53 clSetKernelArg(reduce_sum_2of2, 2, ❹
sizeof(cl_mem), (void *)&dev_redscratch); ❹
54 clSetKernelArg(reduce_sum_2of2, 3, ❹
local_work_size*sizeof(cl_double), NULL); ❹
55
56 clEnqueueNDRangeKernel(command_queue, ❹
reduce_sum_2of2, 1, NULL, &local_work_size, ❹
&local_work_size, 0, NULL, NULL); ❹
57 }
58
59 double total_sum;
60
61 iret=clEnqueueReadBuffer(command_queue, dev_total_sum, CL_TRUE, 0,
1*sizeof(cl_double), &total_sum, 0, NULL, NULL);
62
63 printf("Result -- total sum %lf \n",total_sum);
64
65 clReleaseMemObject(dev_x);
66 clReleaseMemObject(dev_redscratch);
67 clReleaseMemObject(dev_total_sum);
68
69 clReleaseKernel(reduce_sum_1of2);
70 clReleaseKernel(reduce_sum_2of2);
71 clReleaseCommandQueue(command_queue);
72 clReleaseContext(context);
73 clReleaseProgram(program);
❶ Two kernels to create from a single source
❹ ... calls second reduction pass
对第一个内核传递的调用会在第 46 行创建一个本地暂存器数组。中间结果将存储回第 38 行创建的 redscratch 数组中。如果有多个块,则需要第二次传递。将 redscratch 数组传回以完成缩减。请注意,参数 5 和 6 中的内核参数设置为 local_work_size 或单个工作组。这样就可以对所有剩余数据进行同步,并且不需要另一次传递。
The call to the first kernel pass creates a local scratchpad array on line 46. The intermediate results are stored back into the redscratch array created on line 38. If there is more than one block, a second pass is needed. The redscratch array is passed back in to complete the reduction. Note that the kernel parameters in arguments 5 and 6 are set to local_work_size or a single work group. This is so a synchronization can be done across all the remaining data and another pass will not be needed.
SYCL 于 2014 年作为 OpenCL 之上的实验性 C++ 实现开始。创建 SYCL 的开发人员的目标是实现 C++ 语言的更自然的扩展,而不是 OpenCL 与 C 语言的附加感觉。它正在开发为一个跨平台抽象层,利用 OpenCL 的可移植性和效率。当 Intel 选择它作为已宣布的能源部 Aurora HPC 系统的主要语言路径之一时,它的实验语言重点突然发生了变化。Aurora 系统将使用正在开发的新 Intel 独立 GPU。英特尔提议对 SYCL 标准进行一些补充,并在其 oneAPI 开放编程系统的数据并行 C++ (DPCPP) 编译器中进行了原型设计。
SYCL started out in 2014 as an experimental C++ implementation on top of OpenCL. The goal of the developers creating SYCL is a more natural extension of the C++ language than the add-on feeling of OpenCL with the C language. It is being developed as a cross-platform abstraction layer that leverages the portability and efficiency of OpenCL. Its experimental language focus changed suddenly when Intel chose it as one of their major language pathways for the announced Department of Energy Aurora HPC system. The Aurora system will use the new Intel discrete GPUs that are under development. Intel has proposed some additions to the SYCL standard that they have prototyped in their Data Parallel C++ (DPCPP) compiler in their oneAPI open programming system.
您可以通过多种方式了解 SYCL。其中一些甚至避免了必须安装软件或拥有合适的硬件。您可以先尝试以下基于云的系统:
You can get introduced to SYCL in several ways. Some of these even avoid having to install the software or having the right hardware. You might first try out the following cloud-based systems:
交互式 SYCL 在 tech.io 网站上提供了一个教程,网址为 https://tech.io/ playgrounds/48226/introduction-to-sycl/introduction-to-sycl-2。
Interactive SYCL provides a tutorial on the tech.io website at https://tech.io/ playgrounds/48226/introduction-to-sycl/introduction-to-sycl-2.
Intel 在 https://software.intel .com/en-us/oneapi 上提供了 oneAPI 和 DPCPP 的云版本。您必须注册才能使用。
Intel provides a cloud version of oneAPI and DPCPP at https://software.intel .com/en-us/oneapi. You must register to use.
You can also download and install versions of SYCL from these sites:
ComputeCPP 社区版,网址为 https://developer.codeplay.com /products/computecpp/ce/home/。您必须注册才能下载。
The ComputeCPP community edition at https://developer.codeplay.com /products/computecpp/ce/home/. You must register to download.
英特尔 DPCPP 编译器,网址为 https://github.com/intel/llvm/blob/sycl/sycl/ doc/GetStartedGuide.md
The Intel DPCPP compiler at https://github.com/intel/llvm/blob/sycl/sycl/ doc/GetStartedGuide.md
英特尔还在 oneapi-containers/blob/master/images/docker/basekit-devel-ubuntu18.04/ https://github.com/intel/ 提供了 Docker 文件设置说明
Intel also provides Docker file setup instructions at https://github.com/intel/ oneapi-containers/blob/master/images/docker/basekit-devel-ubuntu18.04/ Dockerfile
我们将使用 Intel 的 DPCPP 版本的 SYCL。https://github.com/EssentialsofParallelComputing/Chapter12 的 README.virtualbox 中提供了设置 oneAPI 的 VirtualBox 安装的说明以及本章随附的示例。您应该能够在几乎任何操作系统上运行 VirtualBox。让我们从 DPCPP 编译器的简单 makefile 开始,如下面的清单所示。
We’ll work with Intel’s DPCPP version of SYCL. There are instructions to set up a VirtualBox installation of oneAPI with the examples that accompany this chapter in the README.virtualbox at https://github.com/EssentialsofParallelComputing/Chapter12. You should be able to run VirtualBox on nearly any operating system. Let’s start off with a simple makefile for the DPCPP compiler as the following listing shows.
列表 12.20 SYCL DPCPP 版本的简单 makefile
Listing 12.20 Simple makefile for DPCPP version of SYCL
DPCPP/StreamTriad/Makefile 1 CXX = dpcpp ❶ 2 CXXFLAGS = -std=c++17 -fsycl -O3 ❷ 3 4 all: StreamTriad 5 6 StreamTriad: StreamTriad.o timer.o 7 $(CXX) $(CXXFLAGS) $^ -o $@ 8 9 clean: 10 -rm -f StreamTriad.o StreamTriad
DPCPP/StreamTriad/Makefile 1 CXX = dpcpp ❶ 2 CXXFLAGS = -std=c++17 -fsycl -O3 ❷ 3 4 all: StreamTriad 5 6 StreamTriad: StreamTriad.o timer.o 7 $(CXX) $(CXXFLAGS) $^ -o $@ 8 9 clean: 10 -rm -f StreamTriad.o StreamTriad
❶ Specifies dpcpp as the C++ compiler
❷ Adds the SYCL option to the C++ flags
将 C++ 编译器设置为英特尔 dpcpp 编译器可以处理路径、库和包含文件。唯一的其他要求是为 C++ 编译器设置一些标志。以下清单显示了我们示例的 SYCL 源。
Setting the C++ compiler to the Intel dpcpp compiler takes care of the paths, libraries, and include files. The only other requirement is to set some flags for the C++ compiler. The following listing shows the SYCL source for our example.
清单 12.21 SYCL 的 DPCPP 版本的流三元组示例
Listing 12.21 Stream triad example for DPCPP version of SYCL
DPCPP/StreamTriad/StreamTriad.cc 1 #include <chrono> 2 #include "CL/sycl.hpp" ❶ 3 4 namespace Sycl = cl::sycl; ❷ 5 using namespace std; 6 7 int main(int argc, char * argv[]) 8 { 9 chrono::high_resolution_clock::time_point t1, t2; 10 11 size_t nsize = 10000; 12 cout << "StreamTriad with " << nsize << " elements" << endl; 13 14 // host data 15 vector<double> a(nsize,1.0); ❸ 16 vector<double> b(nsize,2.0); ❸ 17 vector<double> c(nsize,-1.0); ❸ 18 19 t1 = chrono::high_resolution_clock::now(); 20 21 Sycl::queue Queue(Sycl::cpu_selector{}); ❹ 22 23 const double scalar = 3.0; 24 25 Sycl::buffer<double,1> dev_a { a.data(), ❺ Sycl::range<1>(a.size()) }; ❺ 26 Sycl::buffer<double,1> dev_b { b.data(), ❺ Sycl::range<1>(b.size()) }; ❺ 27 Sycl::buffer<double,1> dev_c { c.data(), ❺ Sycl::range<1>(c.size()) }; ❺ 28 29 Queue.submit([&](Sycl::handler& CommandGroup) { ❻ 30 31 auto a = dev_a.get_access<Sycl:: ❼ access::mode::read>(CommandGroup); ❼ 32 auto b = dev_b.get_access<Sycl:: ❼ access::mode::read>(CommandGroup); ❼ 33 auto c = dev_c.get_access<Sycl:: ❼ access::mode::write>(CommandGroup); ❼ 34 35 CommandGroup.parallel_for<class ❽ StreamTriad>(Sycl::range<1>{nsize}, ❽ [=] (Sycl::id<1> it){ ❽ 36 c[it] = a[it] + scalar * b[it]; 37 }); 38 }); 39 Queue.wait(); ❾ 40 41 t2 = chrono::high_resolution_clock::now(); 42 double time1 = chrono::duration_cast< chrono::duration<double> >(t2 - t1).count(); 43 cout << "Runtime is " << time1*1000.0 << " msecs " << endl; 44 }
DPCPP/StreamTriad/StreamTriad.cc 1 #include <chrono> 2 #include "CL/sycl.hpp" ❶ 3 4 namespace Sycl = cl::sycl; ❷ 5 using namespace std; 6 7 int main(int argc, char * argv[]) 8 { 9 chrono::high_resolution_clock::time_point t1, t2; 10 11 size_t nsize = 10000; 12 cout << "StreamTriad with " << nsize << " elements" << endl; 13 14 // host data 15 vector<double> a(nsize,1.0); ❸ 16 vector<double> b(nsize,2.0); ❸ 17 vector<double> c(nsize,-1.0); ❸ 18 19 t1 = chrono::high_resolution_clock::now(); 20 21 Sycl::queue Queue(Sycl::cpu_selector{}); ❹ 22 23 const double scalar = 3.0; 24 25 Sycl::buffer<double,1> dev_a { a.data(), ❺ Sycl::range<1>(a.size()) }; ❺ 26 Sycl::buffer<double,1> dev_b { b.data(), ❺ Sycl::range<1>(b.size()) }; ❺ 27 Sycl::buffer<double,1> dev_c { c.data(), ❺ Sycl::range<1>(c.size()) }; ❺ 28 29 Queue.submit([&](Sycl::handler& CommandGroup) { ❻ 30 31 auto a = dev_a.get_access<Sycl:: ❼ access::mode::read>(CommandGroup); ❼ 32 auto b = dev_b.get_access<Sycl:: ❼ access::mode::read>(CommandGroup); ❼ 33 auto c = dev_c.get_access<Sycl:: ❼ access::mode::write>(CommandGroup); ❼ 34 35 CommandGroup.parallel_for<class ❽ StreamTriad>(Sycl::range<1>{nsize}, ❽ [=] (Sycl::id<1> it){ ❽ 36 c[it] = a[it] + scalar * b[it]; 37 }); 38 }); 39 Queue.wait(); ❾ 40 41 t2 = chrono::high_resolution_clock::now(); 42 double time1 = chrono::duration_cast< chrono::duration<double> >(t2 - t1).count(); 43 cout << "Runtime is " << time1*1000.0 << " msecs " << endl; 44 }
❶ Includes the SYCL header file
❸ Initializes the host side vectors to constants
❺ Allocates the device buffer and sets to the host buffer
❼ Gets access to device arrays
❽ Lambda for parallel for kernel
第一个 Sycl 函数选择一个设备并创建一个队列来处理它。我们要求提供 CPU,但此代码也适用于具有统一内存的 GPU。
The first Sycl function selects a device and creates a queue to work on it. We ask for a CPU, though this code would also work for GPUs with unified memory.
Sycl::queue Queue(sycl::cpu_selector{});Sycl::queue Queue(sycl::cpu_selector{});
我们选择 CPU 以实现最大的可移植性,以便代码可以在大多数系统上运行。要使此代码在没有统一内存的 GPU 上运行,我们需要将数据的显式副本从一个内存空间添加到另一个内存空间。默认选择器优先查找 GPU,但回退到 CPU。如果我们只想选择一个 GPU 或 CPU,我们还可以指定其他选择器,例如
We select a CPU for maximum portability so that the code runs on most systems. To make this code work on GPUs without unified memory, we would need to add explicit copies of data from one memory space to another. The default selector preferentially finds a GPU, but falls back to a CPU. If we want to only select a GPU or CPU, we could also specify other selectors such as
Sycl::queue Queue(sycl::default_selector{}); // uses the default device
Sycl::queue Queue(sycl::gpu_selector{}); // finds a GPU device
Sycl::queue Queue(sycl::cpu_selector{}); // finds a CPU device
Sycl::queue Queue(sycl::host_selector{}); // runs on the host (CPU)Sycl::queue Queue(sycl::default_selector{}); // uses the default device
Sycl::queue Queue(sycl::gpu_selector{}); // finds a GPU device
Sycl::queue Queue(sycl::cpu_selector{}); // finds a CPU device
Sycl::queue Queue(sycl::host_selector{}); // runs on the host (CPU)
最后一个选项意味着它将在主机上运行,就像没有 SYCL 或 OpenCL 代码一样。设备和队列的设置比我们在 OpenCL 中所做的要简单得多。现在我们需要使用 SYCL 缓冲区设置设备缓冲区:
The last option means that it will run on the host as if there were no SYCL or OpenCL code. The setup of the device and queue is far simpler than what we did in OpenCL. Now we need to set up device buffers with the SYCL buffer:
Sycl::buffer<double,1> dev_a { a.data(), Sycl::range<1>(a.size()) };Sycl::buffer<double,1> dev_a { a.data(), Sycl::range<1>(a.size()) };
缓冲区的第一个参数是数据类型,第二个参数是数据的维度。然后我们给它变量名 dev_a。变量的第一个参数是用于初始化设备数组的主机数据数组,第二个参数是要使用的索引集。在这种情况下,我们指定一个从 0 到 a 变量大小的 1D 范围。在第 29 行,我们遇到了为队列创建命令组处理程序的第一个 lambda:
The first argument to the buffer is a data type, and the second is the dimensionality of the data. Then we give it the variable name, dev_a. The first argument to the variable is the host data array to use for initializing the device array, and the second is the index set to use. In this case, we specify a 1D range from 0 to the size of the a variable. On line 29, we encounter the first lambda to create a command group handler for the queue:
Queue.submit([&](Sycl::handler& CommandGroup)
Queue.submit([&](Sycl::handler& CommandGroup)
我们在第 10.2.1 节中引入了 lambda。lambda capture 子句 [&] 指定通过引用捕获例程中使用的外部变量。对于此 lambda,捕获将获取 nsize、scalar、dev_a、dev_b 和 dev_c,以便在 lambda 中使用。我们可以只使用 by reference, [&] 的单个 capture 设置来指定它,也可以使用以下形式来指定它,其中我们指定要捕获的每个变量。良好的编程实践会更喜欢后者,但列表可能会很长。
We introduced lambdas in section 10.2.1. The lambda capture clause, [&], specifies capturing outside variables used in the routine by reference. For this lambda, the capture gets nsize, scalar, dev_a, dev_b, and dev_c for use in the lambda. We could specify it with just the single capture setting of by reference, [&], or with the following form, where we specify each variable that will be captured. Good programming practice would prefer the latter, but the lists can get long.
Queue.submit([&nsize, &scalar, &dev_a, &dev_b, &dev_c]
(Sycl::handler& CommandGroup)Queue.submit([&nsize, &scalar, &dev_a, &dev_b, &dev_c]
(Sycl::handler& CommandGroup)
在 lambda 的主体中,我们可以访问设备数组并重命名它们以在设备例程中使用。这相当于命令组处理程序的参数列表。然后,我们为命令组创建第一个任务,即 parallel_for。parallel_for 也使用 lambda 定义。
In the body of the lambda, we get access to the device arrays and rename them for use within the device routine. This is equivalent to a list of arguments for the command group handler. We then create the first task for the command group, a parallel_for. The parallel_for also is defined with a lambda.
CommandGroup.parallel_for<class StreamTriad>(Sycl::range<1>{nsize},[=]
(Sycl::id<1> it)CommandGroup.parallel_for<class StreamTriad>(Sycl::range<1>{nsize},[=]
(Sycl::id<1> it)
lambda 的名称是 StreamTriad。然后我们告诉它,我们将在从 0 到 nsize 的 1D 范围内进行操作。capture 子句 [=] 按值捕获 a、b 和 c 变量。确定是通过引用还是值捕获是很棘手的。但是,如果代码被推送到 GPU,则原始引用可能超出范围并且不再有效。我们最后创建一个 1D 索引变量 it,以迭代该范围。
The name of the lambda is StreamTriad. We then tell it that we will operate over a 1D range that goes from 0 to nsize. The capture clause, [=], captures the a, b, and c variables by value. Determining whether to capture by reference or value is tricky. But if the code gets pushed to the GPU, the original reference may be out of scope and no longer valid. We last create a 1D index variable, it, to iterate over the range.
到目前为止,您会看到 CPU 和 GPU 内核之间的差异并不大。那么,为什么不使用 C++ 多态性和模板来生成它们呢?嗯,这正是能源部研究实验室开发的几个库所做的。这些项目旨在解决将其许多代码移植到新硬件体系结构的问题。Kokkos 系统由桑迪亚国家实验室创建,并获得了广泛的追随者。劳伦斯利弗莫尔国家实验室有一个名为 RAJA 的类似项目。这两个项目都已经成功地实现了单源、多平台功能的目标。
By now, you are seeing that the differences between CPU and GPU kernels are not all that big. So why not generate each of them using C++ polymorphism and templates? Well, that is exactly what a couple of libraries developed by Department of Energy research laboratories have done. These projects were started to tackle the porting of many of their codes to new hardware architectures. The Kokkos system was created by Sandia National Laboratories and has gained a wide following. Lawrence Livermore National Laboratory has a similar project by the name of RAJA. Both of these projects have already succeeded in their goal of a single-source, multiplatform capability.
这两种语言在很多方面与您在第 12.4 节中看到的 SYCL 语言有相似之处。事实上,他们在努力实现性能可移植性时相互借鉴了概念。它们中的每一个都提供了在较低级别并行编程语言之上相当轻量级的库。我们将简要介绍它们中的每一个。
These two languages have similarities in a lot of respects to the SYCL language that you saw in section 12.4. Indeed, they have borrowed concepts from each other as they strive for performance portability. Each of them provides libraries that are fairly light layers on top of lower-level parallel programming languages. We’ll take a short look at each of them.
Kokkos 是一个设计精良的抽象层,适用于 OpenMP 和 CUDA 等语言。它自 2011 年以来一直在开发中。Kokkos 具有以下命名的执行空间。这些在 Kokkos 构建中使用 CMake 的相应标志(或使用 Spack 构建的选项)启用。其中一些比其他的开发得更好。
Kokkos is a well-designed abstraction layer for languages such as OpenMP and CUDA. It has been in development since 2011. Kokkos has the following named execution spaces. These are enabled in the Kokkos build with the corresponding flag to CMake (or the option to build with Spack). Some of these are better developed than others.
Listing 12.22 Kokkos CMake file
Kokkos/StreamTriad/CMakeLists.txt 1 cmake_minimum_required (VERSION 3.10) 2 project (StreamTriad) 3 4 find_package(Kokkos REQUIRED) ❶ 5 6 add_executable(StreamTriad StreamTriad.cc) 7 target_link_libraries(StreamTriad Kokkos::kokkos) ❷
Kokkos/StreamTriad/CMakeLists.txt 1 cmake_minimum_required (VERSION 3.10) 2 project (StreamTriad) 3 4 find_package(Kokkos REQUIRED) ❶ 5 6 add_executable(StreamTriad StreamTriad.cc) 7 target_link_libraries(StreamTriad Kokkos::kokkos) ❷
❶ Finds Kokkos and sets the flags
❷ Adds dependencies and flags to build
将 CUDA 选项添加到 Kokkos 构建会生成一个在 NVIDIA GPU 上运行的版本。Kokkos 可以处理许多其他平台和语言,并且一直在开发更多平台和语言。
Adding the CUDA option to the Kokkos build generates a version that runs on NVIDIA GPUs. There are many other platforms and languages that Kokkos can handle and more are being developed all the time.
清单 12.23 中的 Kokkos 流三元组示例与 SYCL 有一些相似之处,因为它使用 C++ lambda 来封装 CPU 或 GPU 的函数。Kokkos 还支持此机制的 functor,但 lambda 在实践中使用起来不太冗长。
The Kokkos stream triad example in listing 12.23 has some similarities to SYCL in that it uses C++ lambdas to encapsulate functions for either the CPU or GPU. Kokkos also supports functors for this mechanism, but lambdas are less verbose to use in practice.
Listing 12.23 Kokkos stream triad example
Kokkos/StreamTriad/StreamTriad.cc 1 #include <Kokkos_Core.hpp> ❶ 2 3 using namespace std; 4 5 int main (int argc, char *argv[]) 6 { 7 Kokkos::initialize(argc, argv);{ ❷ 8 9 Kokkos::Timer timer; 10 double time1; 11 12 double scalar = 3.0; 13 size_t nsize = 1000000; 14 Kokkos::View<double *> a( "a", nsize); ❸ 15 Kokkos::View<double *> b( "b", nsize); ❸ 16 Kokkos::View<double *> c( "c", nsize); ❸ 17 18 cout << "StreamTriad with " << nsize << " elements" << endl; 19 20 Kokkos::parallel_for(nsize, ❹ KOKKOS_LAMBDA (int i) { ❹ 21 a[i] = 1.0; ❹ 22 }); ❹ 23 Kokkos::parallel_for(nsize, ❹ KOKKOS_LAMBDA (int i) { ❹ 24 b[i] = 2.0; ❹ 25 }); ❹ 26 27 timer.reset(); 28 29 Kokkos::parallel_for(nsize, ❹ KOKKOS_LAMBDA (const int i) { ❹ 30 c[i] = a[i] + scalar * b[i]; ❹ 31 }); ❹ 32 33 time1 = timer.seconds(); 34 35 icount = 0; 36 for (int i=0; i<nsize && icount < 10; i++){ 37 if (c[i] != 1.0 + 3.0*2.0) { 38 cout << "Error with result c[" << i << "]=" << c[i] << endl; 39 icount++; 40 } 41 } 42 43 if (icount == 0) cout << "Program completed without error." << endl; 44 cout << "Runtime is " << time1*1000.0 << " msecs " << endl; 45 46 } 47 Kokkos::finalize(); ❺ 48 return 0; 49 }
Kokkos/StreamTriad/StreamTriad.cc 1 #include <Kokkos_Core.hpp> ❶ 2 3 using namespace std; 4 5 int main (int argc, char *argv[]) 6 { 7 Kokkos::initialize(argc, argv);{ ❷ 8 9 Kokkos::Timer timer; 10 double time1; 11 12 double scalar = 3.0; 13 size_t nsize = 1000000; 14 Kokkos::View<double *> a( "a", nsize); ❸ 15 Kokkos::View<double *> b( "b", nsize); ❸ 16 Kokkos::View<double *> c( "c", nsize); ❸ 17 18 cout << "StreamTriad with " << nsize << " elements" << endl; 19 20 Kokkos::parallel_for(nsize, ❹ KOKKOS_LAMBDA (int i) { ❹ 21 a[i] = 1.0; ❹ 22 }); ❹ 23 Kokkos::parallel_for(nsize, ❹ KOKKOS_LAMBDA (int i) { ❹ 24 b[i] = 2.0; ❹ 25 }); ❹ 26 27 timer.reset(); 28 29 Kokkos::parallel_for(nsize, ❹ KOKKOS_LAMBDA (const int i) { ❹ 30 c[i] = a[i] + scalar * b[i]; ❹ 31 }); ❹ 32 33 time1 = timer.seconds(); 34 35 icount = 0; 36 for (int i=0; i<nsize && icount < 10; i++){ 37 if (c[i] != 1.0 + 3.0*2.0) { 38 cout << "Error with result c[" << i << "]=" << c[i] << endl; 39 icount++; 40 } 41 } 42 43 if (icount == 0) cout << "Program completed without error." << endl; 44 cout << "Runtime is " << time1*1000.0 << " msecs " << endl; 45 46 } 47 Kokkos::finalize(); ❺ 48 return 0; 49 }
❶ Includes the appropriate Kokkos header
❸ Declares arrays with Kokkos::View
❹ 适用于 CPU 或 GPU 的 Kokkos parallel_for lambda
❹ Kokkos parallel_for lambdas for CPU or GPU
Kokkos 程序以 Kokkos::initialize 和 Kokkos::final 开头。这些命令启动执行空间所需的内容,例如线程。Kokkos 的独特之处在于,它将灵活的多维数组分配封装为数据视图,可以根据目标架构进行切换。换句话说,您可以为 CPU 和 GPU 使用不同的数据顺序。我们在第 14-16 行使用 Kokkos::View,尽管这仅适用于 1D 数组。实际值来自多维数组。Kokkos::View 的一般语法是
The Kokkos program starts with Kokkos::initialize and Kokkos::finalize. These commands start up those things that are needed for the execution space, such as threads. Kokkos is unique in that it encapsulates flexible multi-dimensional array allocations as data views that can be switched depending on the target architecture. In other words, you can use a different data order for CPU versus GPU. We use Kokkos::View on lines 14-16, though this is only for 1D arrays. The real value comes with multidimensional arrays. The general syntax for Kokkos::View is
View < double *** , Layout , MemorySpace > name (...);
View < double *** , Layout , MemorySpace > name (...);
内存空间是模板的一个选项,但具有适合执行空间的默认值。一些内存空间是
Memory spaces are an option for the template, but have a default appropriate for the execution space. Some memory spaces are
The layout can be specified, although it has a default appropriate for the memory space:
For LayoutLeft, the leftmost index is stride 1 (default for CudaSpace)
For LayoutRight, the rightmost index is stride 1 (default for HostSpace)
内核是在以下三种数据并行模式之一上使用 lambda 语法指定的:
The kernels are specified using a lambda syntax on one of three data parallel patterns:
在清单 12.23 的第 20、23 和 29 行上,我们使用了 parallel_for 模式。KOKKOS_LAMBDA 宏取代了 [=] 或 [&] 捕获语法。Kokkos 会为您指定此内容,并以更具可读性的形式进行指定。
On lines 20, 23, and 29 in listing 12.23, we used the parallel_for pattern. The KOKKOS_LAMBDA macro replaces the [=] or [&] capture syntax. Kokkos takes care of specifying this for you and does it in a much more readable form.
RAJA 性能可移植性层的目标是实现可移植性,同时最大限度地减少对现有劳伦斯利弗莫尔国家实验室代码的干扰。在许多方面,它比其他同类系统更简单、更容易采用。RAJA 可以构建并支持以下内容:
The RAJA performance portability layer has the goal of achieving portability with a minimum of disruptions to existing Lawrence Livermore National Laboratory codes. In many ways, it is simpler and easier to adopt than other comparable systems. RAJA can be built with support for the following:
RAJA 对 CMake 也有很好的支持,如下面的清单所示。
RAJA also has good support for CMake as the following listing shows.
Raja/StreamTriad/CMakeLists.txt
1 cmake_minimum_required (VERSION 3.0)
2 project (StreamTriad)
3
4 find_package(Raja REQUIRED)
5 find_package(OpenMP REQUIRED)
6
7 add_executable(StreamTriad StreamTriad.cc)
8 target_link_libraries(StreamTriad PUBLIC RAJA)
9 set_target_properties(StreamTriad PROPERTIES
COMPILE_FLAGS ${OpenMP_CXX_FLAGS})
10 set_target_properties(StreamTriad PROPERTIES
LINK_FLAGS "${OpenMP_CXX_FLAGS}")Raja/StreamTriad/CMakeLists.txt
1 cmake_minimum_required (VERSION 3.0)
2 project (StreamTriad)
3
4 find_package(Raja REQUIRED)
5 find_package(OpenMP REQUIRED)
6
7 add_executable(StreamTriad StreamTriad.cc)
8 target_link_libraries(StreamTriad PUBLIC RAJA)
9 set_target_properties(StreamTriad PROPERTIES
COMPILE_FLAGS ${OpenMP_CXX_FLAGS})
10 set_target_properties(StreamTriad PROPERTIES
LINK_FLAGS "${OpenMP_CXX_FLAGS}")
流三元组的 RAJA 版本只需要进行一些更改,如下面的清单所示。RAJA 还大量利用 lambda 为 CPU 和 GPU 提供可移植性。
The RAJA version of the stream triad takes only a few changes as the following listing shows. RAJA also heavily leverages lambdas to provide their portability to CPUs and GPUs.
Listing 12.25 Raja stream triad example
Raja/StreamTriad/StreamTriad.cc 1 #include <chrono> 2 #include "RAJA/RAJA.hpp" ❶ 3 4 using namespace std; 5 6 int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[])) 7 { 8 chrono::high_resolution_clock::time_point t1, t2; 9 cout << "Running Raja Stream Triad\n"; 10 11 const int nsize = 1000000; 12 13 // Allocate and initialize vector data. 14 double scalar = 3.0; 15 double* a = new double[nsize]; 16 double* b = new double[nsize]; 17 double* c = new double[nsize]; 18 19 for (int i = 0; i < nsize; i++) { 20 a[i] = 1.0; 21 b[i] = 2.0; 22 } 23 24 t1 = chrono::high_resolution_clock::now(); 25 26 RAJA::forall<RAJA::omp_parallel_for_exec>( ❷ RAJA::RangeSegment(0,nsize),[=](int i){ ❷ 27 c[i] = a[i] + scalar * b[i]; ❷ 28 }); ❷ 29 30 t2 = chrono::high_resolution_clock::now(); 31 < ... error checking ... > 42 double time1 = chrono::duration_cast< chrono::duration<double> >(t2 - t1).count(); 43 cout << "Runtime is " << time1*1000.0 << " msecs " << endl; 44 }
Raja/StreamTriad/StreamTriad.cc 1 #include <chrono> 2 #include "RAJA/RAJA.hpp" ❶ 3 4 using namespace std; 5 6 int main(int RAJA_UNUSED_ARG(argc), char **RAJA_UNUSED_ARG(argv[])) 7 { 8 chrono::high_resolution_clock::time_point t1, t2; 9 cout << "Running Raja Stream Triad\n"; 10 11 const int nsize = 1000000; 12 13 // Allocate and initialize vector data. 14 double scalar = 3.0; 15 double* a = new double[nsize]; 16 double* b = new double[nsize]; 17 double* c = new double[nsize]; 18 19 for (int i = 0; i < nsize; i++) { 20 a[i] = 1.0; 21 b[i] = 2.0; 22 } 23 24 t1 = chrono::high_resolution_clock::now(); 25 26 RAJA::forall<RAJA::omp_parallel_for_exec>( ❷ RAJA::RangeSegment(0,nsize),[=](int i){ ❷ 27 c[i] = a[i] + scalar * b[i]; ❷ 28 }); ❷ 29 30 t2 = chrono::high_resolution_clock::now(); 31 < ... error checking ... > 42 double time1 = chrono::duration_cast< chrono::duration<double> >(t2 - t1).count(); 43 cout << "Runtime is " << time1*1000.0 << " msecs " << endl; 44 }
❷ Raja forall using C++ lambda
RAJA 所需的更改是在第 2 行包含 RAJA 头文件,并将计算循环更改为 Raja::forall。您可以看到,RAJA 开发人员为获得性能可移植性提供了一个较低的入门门槛。为了运行 RAJA 测试,我们提供了一个构建和安装 RAJA 的脚本,如下面的清单所示。然后,该脚本继续使用 RAJA 构建流三元组代码并运行它。
The required changes for RAJA are to include the RAJA header file on line 2 and to change the computation loop to a Raja::forall. You can see that the RAJA developers provide a low-entry threshold to gaining performance portability. To run the RAJA test, we included a script that builds and installs RAJA as the following listing shows. The script then goes on to build the stream triad code with RAJA and runs it.
Listing 12.26 Raja 流三合会的集成构建和运行脚本
Listing 12.26 Integrated build and run script for Raja stream triad
Raja/StreamTriad/Setup_Raja.sh
1 #!/bin/sh
2 export INSTALL_DIR=`pwd`/build/Raja
3 export Raja_DIR=${INSTALL_DIR}/share/raja/cmake ❶
4
5 mkdir -p build/Raja_tmp && cd build/Raja_tmp
6 cmake ../../Raja_build -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR}
7 make -j 8 install && cd .. && rm -rf Raja_tmp
8
9 cmake .. && make && ./StreamTriad ❷Raja/StreamTriad/Setup_Raja.sh
1 #!/bin/sh
2 export INSTALL_DIR=`pwd`/build/Raja
3 export Raja_DIR=${INSTALL_DIR}/share/raja/cmake ❶
4
5 mkdir -p build/Raja_tmp && cd build/Raja_tmp
6 cmake ../../Raja_build -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR}
7 make -j 8 install && cd .. && rm -rf Raja_tmp
8
9 cmake .. && make && ./StreamTriad ❷
❶ Raja_DIR points to Raja CMake tool.
❷ Builds the stream triad code and runs it
在本章中,我们介绍了许多不同的编程语言。将这些视为一种通用语言的方言,而不是完全不同的方言。
We covered a lot of different programming languages in this chapter. Think of these as dialects of a common language rather than completely different ones.
我们才刚刚开始触及所有这些原生 GPU 语言和性能可移植性的表面。即使显示了初始功能,您也可以开始实现一些实际的应用程序代码。如果您真的想在应用程序中使用它们中的任何一个,我们强烈建议您利用所选语言的许多其他资源。
We have only begun to scratch the surface with all of these native GPU languages and performance portability systems. Even with the initial functionality shown, you can begin to implement some real application codes. If you’re serious about using any of these in your applications, we strongly recommend availing yourself of the many additional resources for the language of your choice.
作为多年来占据主导地位的 GPU 语言,关于 CUDA 编程的材料很多。也许首先要去的地方是 NVIDIA 开发者网站,网址为 https://developer.nvidia.com/cuda-zone。在那里,您可以找到有关安装和使用 CUDA 的大量指南。
As the dominant GPU language for many years, there are many materials on CUDA programming. Perhaps the first place to go is the NVIDIA Developer’s website at https://developer.nvidia.com/cuda-zone. There you’ll find extensive guides on installing and using CUDA.
David B. Kirk 和 W. Hwu 温梅,大规模并行处理器编程:一种动手实践方法(Morgan Kaufmann,2016 年)。
David B. Kirk and W. Hwu Wen-Mei, Programming massively parallel processors: a hands-on approach (Morgan Kaufmann, 2016).
AMD (https://rocm.github.io) 创建了一个网站,涵盖了其 ROCm 生态系统的各个方面。
AMD (https://rocm.github.io) has created a website that covers all aspects of their ROCm ecosystem.
If you want to really learn more about OpenCL, we highly recommend the book by Matthew Scarpino:
Matthew Scarpino,OpenCL 在行动:如何加速图形和计算(Manning,2011 年)。
Matthew Scarpino, OpenCL in action: how to accelerate graphics and computations (Manning, 2011).
https://www.iwocl.org 是 OpenCL 其他信息的良好来源,由 OpenCL 国际研讨会 (IWOCL) 赞助。他们每年还举办一次国际会议。SYCLcon 也通过同一站点托管。
A good source of additional information on OpenCL is https://www.iwocl.org, sponsored by the International Workshop on OpenCL (IWOCL). They also host an international conference annually. SYCLcon is also hosted through the same site.
Khronos is the open standards body for OpenCL, SYCL, and related software. They host the language specifications, forums, and resource lists:
Khronos Group, https://www.khronos.org/opencl/ 和 https://www.khronos .org/sycl/。
Khronos Group, https://www.khronos.org/opencl/ and https://www.khronos .org/sycl/.
有关 Kokkos 的文档和培训材料,请参阅其 GitHub 存储库。除了下载 Kokkos 软件外,您还会找到他们在全国各地提供的教程的配套存储库 (https://github.com/kokkos/kokkos-tutorials)。
For documentation and training materials on Kokkos, see their GitHub repository. Besides downloading the Kokkos software, you’ll also find a companion repository (https://github.com/kokkos/kokkos-tutorials) for the tutorials they give around the country.
RAJA 团队 (https://raja.readthedocs.io) 在其网站上提供了大量文档。
The RAJA team (https://raja.readthedocs.io) has extensive documentation at their website.
Change the host memory allocation in the CUDA stream triad example to use pinned memory (listings 12.1-12.6). Did you get a performance improvement?
对于总和缩减示例,请尝试 18,000 个元素的数组大小,所有元素都初始化为其索引值。运行 CUDA 代码,然后运行 SumReductionRevealed 中的版本。您可能需要调整打印的信息量。
For the sum reduction example, try an array size of 18,000 elements all initialized to their index value. Run the CUDA code and then the version in SumReductionRevealed. You may want to adjust the amount of information printed.
For the SYCL example in listing 12.20, initialize the a and b arrays on the GPU device.
将清单 12.24 中 RAJA 示例中的两个初始化循环转换为 Raja:forall 语法。尝试使用 CUDA 运行示例。
Convert the two initialization loops in the RAJA example in listing 12.24 to the Raja:forall syntax. Try running the example with CUDA.
Use straightforward modifications from the original CPU code for most kernels. This makes the writing of kernels simpler and easier to maintain.
在 GPU 内核中仔细设计合作和比较可以产生良好的性能。处理这些操作的关键是将算法分解为多个步骤并了解 GPU 的性能属性。
Careful design of cooperation and comparison in GPU kernels can yield good performance. The key to approaching these operations is breaking down the algorithm into steps and understanding the performance properties of the GPU.
Think about portability from the start. You will avoid having to create more code versions every time you want to run your application on another hardware platform.
Consider the single-source performance portability languages. If you need to run on a variety of hardware, these can be worth the initial difficulty in code development.
1. 有关详细信息,请参阅 CUDA 安装指南 (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)。
1. See the CUDA installation guide for details (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
在本章中,我们将介绍可用于加速应用程序开发的工具和不同的工作流程。我们将向您展示 GPU 的分析工具如何提供帮助。此外,我们还将讨论在远程 HPC 集群上工作时如何应对使用分析工具的挑战。由于分析工具不断变化和改进,因此我们将重点介绍方法,而不是任何一种工具的细节。本章的主要内容是了解如何在使用强大的 GPU 分析工具时创建高效的工作流程。
In this chapter, we will cover the tools and the different workflows that you can use to accelerate your application development. We’ll show you how profiling tools for the GPU can be helpful. In addition, we’ll discuss how to deal with the challenges of using profiling tools when working on a remote HPC cluster. Because the profiling tools continue to change and improve, we’ll focus on the methodology rather than the details of any one tool. The main takeaway of this chapter will be understanding how to create a productive workflow when using the powerful GPU profiling tools.
分析工具可以更快地进行优化,提高硬件利用率,并更好地了解应用程序性能和热点。我们将讨论分析工具如何暴露瓶颈并帮助您更好地使用硬件。以下项目符号列表突出显示了 GPU 分析中的常用工具。我们专门展示了用于其 GPU 的 NVIDIA 工具,因为这些工具存在的时间最长。如果您的系统上有其他供应商的 GPU,请在工作流程中替换他们的工具。不要忘记标准的 Unix 分析工具,例如 gprof,我们稍后将在 Section 13.4.2 中使用它。
Profiling tools allow for quicker optimization, improving hardware utilization, and a better understanding of the application performance and hotspots. We’ll discuss how profiling tools expose bottlenecks and assist you in attaining better hardware usage. The following bulleted list highlights the commonly used tools in GPU profiling. We specifically show the NVIDIA tools for use with their GPUs because these tools have been around the longest. If you have a different vendor’s GPU on your system, substitute their tools in the workflow. Don’t forget about the standard Unix profiling tools such as gprof that we’ll use later in section 13.4.2.
我们鼓励您按照本章的示例进行操作。随附的源代码位于 http://github.com/EssentialsOfParallelComputing/Chapter13,其中显示了为来自不同硬件供应商的工具安装软件包的示例。有可以为每个供应商安装的所有软件的详细列表。您可能希望为相应的硬件安装工具。
We encourage you to follow along with the examples for this chapter. The accompanying source code is at http://github.com/EssentialsOfParallelComputing/Chapter13, which shows examples of installing the software packages for tools from different hardware vendors. There are detailed lists of all the software that can be installed for each vendor. You will probably want to install the tools for the corresponding hardware.
注意虽然其他供应商的工具可能部分在您的系统上运行,但其全部功能将受到削弱。
Note While a tool for another vendor might partially run on your system, its full functionality will be crippled.
NVIDIA nvidia-smi - 尝试从命令行获取快速系统配置文件时,可以使用 nvidia-smi。如第 9.6.2 节所示和解释, NVIDIA SMI (系统管理接口)允许在应用程序运行期间监控和收集功率和温度。NVIDIA SMI 为您提供硬件信息以及许多其他系统指标。SMI 指南和选项的链接位于本章后面的 “进一步探索” 部分中。
NVIDIA nvidia-smi—When trying to get a quick system profile from the command line, you can use nvidia-smi. As shown and explained in section 9.6.2, NVIDIA SMI (System Management Interface) allows for monitoring and collecting power and temperature during an application run. NVIDIA SMI gives you hardware information along with many other system metrics. The link to the SMI guide and options are in the “Further Explorations” section later in this chapter.
NVIDIA nvprof — 此 NVIDIA Visual Profiler 命令行工具可收集和报告 GPU 性能数据。数据还可以导入到可视化分析工具(如 NVIDIA Visual Profiler、NVVP)或其他格式中,以进行应用程序性能分析。它显示性能指标,例如硬件到设备的副本、内核使用情况、内存利用率和许多其他指标。
NVIDIA nvprof—This NVIDIA Visual Profiler command-line tool collects and reports data on GPU performance. The data can also be imported into a visual profiling tool such as the NVIDIA Visual Profiler NVVP or other formats for application performance analysis. It shows performance metrics such as hardware-to-device copies, kernel usage, memory utilization, and many other metrics.
NVIDIA NVVP — 此 NVIDIA Visual Profiler 工具提供应用程序内核性能的可视化表示。NVVP 提供 GUI 和指导分析。它查询与 nvprof 相同的数据,但以可视化方式向用户表示数据,从而提供 nvprof 上不容易提供的快速时间线功能。
NVIDIA NVVP—This NVIDIA Visual Profiler tool provides a visual representation of the application kernel performance. NVVP provides a GUI and guided analysis. It queries the same data the nvprof does, but represents the data to the user in a visual way, offering a quick timeline feature not as readily available on nvprof.
NVIDIA® Nsight™ — NSight 是 NVVP 的更新版本,可提供 CPU 和 GPU 使用情况以及应用程序性能的可视化表示。最终,它可能会取代 NVVP。
NVIDIA® Nsight™—NSight is an updated version of NVVP that provides for a visual representation of CPU and GPU usage and application performance. Eventually, it may replace NVVP.
NVIDIA PGPROF — PGPROF 实用程序起源于 Portland Group 编译器。当 Portland Group 被 NVIDIA 收购以获取其 Fortran 编译器时,他们将 Portland 的分析器 PGPROF 与 NVIDIA 工具合并。
NVIDIA PGPROF—The PGPROF utility originated with the Portland Group compiler. When the Portland Group was acquired by NVIDIA for their Fortran compiler, they merged Portland’s profiler, PGPROF, with the NVIDIA tools.
CodeXL(原 AMD CodeXL)— 此 GPUOpen 分析器、调试器和编程开发工作台最初由 AMD 开发。请参阅本章后面的“附加阅读”部分中的 CodeXL 网站链接。
CodeXL (originally AMD CodeXL)—This GPUOpen profiler, debugger, and programming development workbench was originally developed by AMD. See the link to the CodeXL website in the “Additional Reading” section later in this chapter.
在开始任何复杂任务之前,您必须选择适当的工作流程。您可能在现场连接良好,在异地时家庭网络速度较慢,或者介于两者之间。每个案例都需要不同的工作流程。在本节中,我们将讨论针对这些不同场景的四种潜在且高效的工作流。
Before beginning any complicated task, you must select the appropriate workflow. You might either be onsite with excellent connectivity, offsite with a slow home network, or somewhere in between. Each case requires a different workflow. In this section, we’ll discuss four potential and efficient workflows for these different scenarios.
图 13.1 提供了四种不同工作流的可视化表示。可访问性和连接速度是决定您最终使用哪种方法的决定因素。您可以直接在系统上使用图形界面运行这些工具,也可以使用客户端-服务器模式远程运行这些工具,或者只使用命令行工具来避免问题。
Figure 13.1 provides a visual representation of the four different workflows. Accessibility and connection speed are the determining factors in deciding which method you end up using. You can either run the tools with a graphics interface directly on the system, remotely with a client-server mode, or just avoid the problem by using command-line tools.
图 13.1 有几种不同的分析工具使用方法,可为您的应用程序开发情况提供替代方案。
Figure 13.1 There are several different methods of using the profiling tools that give you alternatives for your application development situation.
从远程服务器使用分析工具时,可视化和图形界面响应通常会有严重的延迟。客户端-服务器模式将图形界面分开,以便它在系统上本地运行。然后,它与远程站点的服务器通信以运行命令。这有助于保持图形工具界面的交互式响应。例如,在远程服务器上使用时,NVVP 等分析工具可能会有很高的延迟。每次单击鼠标后等待几分钟并不是一个非常有效的情况。幸运的是,NVIDIA 工具和许多其他工具为您提供了多种选项来解决此问题。在下面的讨论中,我们将更详细地介绍不同的工作流程。
When using profiling tools from a remote server, there is often a heavy delay in visualization and graphics interface response. Client-server mode separates the graphics interface so that it runs locally on your system. It then communicates with the server at the remote site to run the commands. This helps keep the interactive response of the graphical tool interface. For example, profiling tools such as NVVP can have a high latency when used on a remote server. Waiting minutes after every mouse click is not a very productive situation. Fortunately, the NVIDIA tools and many of the other tools give you several options to work around this problem. We go into greater detail on the different workflows in the following discussion.
方法 1:直接在系统上运行 – 当图形应用程序的网络连接速度很快时,这是首选方法,因为存储要求非常大。如果您有一个用于图形显示的快速连接,这是最有效的工作方式。但是,如果显示网络连接速度较慢,则图形窗口的响应时间会很长,并且您将需要使用其中一个远程选项。VNC、X2Go 和 NoMachine 可以压缩图形输出并发送它,有时使较慢的连接变得有效。
Method 1: Run directly on the system—When your network connection for your graphics application is fast, this is the preferred method because the storage requirements are pretty large. If you have a fast connection for graphics display, it is the most efficient way to work. But if your display network connection is slow, the response time for the graphics window is painful, and you will want to use one of the remote options. VNC, X2Go, and NoMachine can compress the graphics output and send it instead, sometimes making slower connections workable.
方法 2:远程服务器 - 此方法在 GPU 系统上使用命令行工具运行应用程序,然后文件将自动传输到本地系统。防火墙、HPC 系统的批处理操作和其他网络复杂性可能会使此方法难以或无法设置。
Method 2: Remote server—This method runs the application with a command-line tool on the GPU system, then the files are transferred automatically to your local system. Firewalls, batch operations of the HPC system, and other network complications can make this method difficult or impossible to set up.
方法 3:配置文件下载 — 此方法在 HPC 站点上运行 nvprof 并将文件下载到本地计算机。在此方法中,您可以使用安全复制 (scp) 或其他实用程序手动将文件传输到本地计算机,然后在本地计算机上工作。尝试分析多个应用程序时,可以更轻松地获取 csv 格式的原始数据并将其合并到单个 DataFrame 中。尽管传统的分析工具可能不再使用此方法,但您可以在服务器上或本地进行自己的详细分析。
Method 3: Profile file download—This method runs nvprof on an HPC site and downloads the files to your local computer. In this method, you manually transfer files to your local computer using secure copy (scp) or some other utility and then work on your local machine. When trying to profile multiple applications, it can be easier to take the raw data in a csv format and combine it into a single dataframe. Though this method may no longer be usable by the conventional profiling tools, you can do your own detailed analysis on the server or locally.
方法 4:本地开发 – 当今 HPC 硬件的一大优点是,您通常拥有类似的硬件,可用于在本地开发应用程序。您可能拥有来自同一供应商的 GPU,但不如 HPC 系统中的 GPU 强大。您可以优化您的应用程序,并期望在大型系统上一切都会更快。您还可以使用一些调试更容易的语言在 CPU 上开发代码。
Method 4: Develop locally—One of the great things about today’s HPC hardware is that you often have similar hardware that you can use to develop an application locally. You might have a GPU from the same vendor but not as powerful as the GPU in the HPC system. You can optimize your application with the expectation that everything will be faster on the big system. You might also be able to develop your code on the CPU with some of the languages where debugging is easier.
需要认识到的重要一点是,即使您没有快速连接到计算站点,在使用开发工具时也有一些选择。无论您使用哪种方法进行移植和性能分析,您都应该确保您使用的软件版本匹配。这对于 CUDA 以及 NVIDIA nvprof 和 NVVP 工具尤为重要。
The important thing to realize is that even if you are not on a fast connection to a computing site, you have some options when using development tools. Whichever method you use to do your porting and performance analysis, you should ensure that the versions of the software you use match. This is particularly important for CUDA and the NVIDIA nvprof and NVVP tools.
在本节中,我们将使用一个实际示例来展示代码移植过程和一些可用工具的使用。我们将使用图 1.9 中的问题,其中火山喷发或地震可能会导致海啸向外传播。海啸可以以几英尺的高度穿越海洋数千英里,但当它们到达海岸时,它们可能高达数百米。这些类型的模拟通常在事件之后完成,因为设置和运行问题需要时间。我们更愿意实时模拟它,以便我们可以向可能受影响的人提供警告。通过在 GPU 上运行模拟来加速模拟可能会提供此功能。
In this section, we’ll work with a realistic example to show the code porting process and the use of some available tools. We’ll use the problem from figure 1.9, where a volcanic eruption or earthquake might cause a tsunami to propagate outward. Tsunamis can travel thousands of miles across oceans with just a few feet of height, but when these reach the shore, they can be hundreds of meters high. These types of simulations are usually done after the event because of the time required to set up and run the problem. We’d prefer to simulate it in real time so that we can provide warnings to those who might be affected. Speeding up the simulation by running it on a GPU might provide this capability.
我们首先介绍此场景中发生的物理场,然后将其转换为方程以数值模拟问题。我们想要表示的具体场景是一大块岛屿或其他陆地的分离,如图 13.2 所示,它落入海洋。这件事实际上发生在 2018 年 12 月的 Anak Krakatau(“喀拉喀托之子”)身上。
We’ll first walk through the physics that occurs in this scenario then translate that into equations to numerically simulate the problem. The specific scenario we want to represent is the breaking off of a large mass of an island or other land mass, which falls into the ocean as figure 13.2 illustrates. This event actually happened with Anak Krakatau (“Child of Krakatau”) in December, 2018.
图 13.2 2018 年 12 月 22 日在 Anak Krakatau 发生的海啸波是由火山岛的沉积物滑坡引起的。
Figure 13.2 The tsunami wave that occurred at Anak Krakatau on December 22, 2018, was caused by a sediment slide from the volcanic island.
对于 12 月的事件,喀拉喀托岛西侧的滑坡体积约为 0.2 立方公里。这比早期的风险预测要小。此外,波高估计超过 100 米。由于从源头到岸边的距离很短,对该地区的人们几乎没有警告,而且有 400 多人死亡,该事件获得了全世界的新闻报道。
For the December event, the landslide volume on the west flank of the Krakatau island was about 0.2 cubic km. This was smaller than earlier risk projections estimated. Additionally, wave heights were estimated to be over 100 meters. With the short distance from the source to the shore, there was little warning for those in the area, and with over 400 deaths, the event garnered world-wide news coverage.
科学家们在事件发生前进行了许多模拟,之后甚至进行了更多模拟。您可以在 http:// mng.bz/4Mqw 中查看一些可视化效果和事件分析。模拟是如何完成的?所需的基本物理学只是我们在本书中研究的模板计算复杂性的一小步。成熟的仿真代码可能具有更复杂的花里胡哨功能,但我们可以通过简单的物理学走很长的路。那么,我们来看看模拟背后的必要物理场。
Scientists performed many simulations prior to the event and even more afterward. You can view some of the visualizations and an analysis of the event at http:// mng.bz/4Mqw. How were the simulations done? The basic physics required is only a small step in complexity from the stencil calculations that we have looked at throughout this book. A full-fledged simulation code might have a lot more sophisticated bells and whistles, but we can go a long way with simple physics. So let’s take a look at the required physics behind the simulations.
海啸的数学方程式相对简单。它们是质量守恒和动量守恒。后者基本上是牛顿第一运动定律:“静止的物体保持静止,运动中的物体保持运动。动量方程使用第二运动定律,“力等于动量的变化”。对于质量守恒方程,我们基本上可以得出,计算单元的质量在小时间增量上的变化等于穿过单元边界的质量之和,如下所示:
The mathematical equations for the tsunami are relatively simple. These are conservation of mass and conservation of momentum. The latter is basically Newton’s first law of motion: “An object at rest stays at rest and an object in motion stays in motion.” The momentum equation uses the second law of motion, “Force is equal to the change in momentum.” For the conservation of the mass equation, we basically have that the change in mass for a computational cell over a small increment in time is equal to the sum of the mass crossing the cell boundaries as shown here:
其中 是质量相对于时间的变化,和
是 x 面和 y 面上的质量通量(速度 * 质量)。此外,由于水是不可压缩的,因此可以将水的密度视为恒定。细胞的质量是体积 * 密度。如果我们的单元格都是 1 米× 1 米,则体积高度× 1 米× 1 米。综上所述,除了 height 之外,一切都是恒定的,因此我们可以用 height 变量替换 mass:
Where is the change in mass relative to time, and and are the mass fluxes (velocity * mass) across the x- and y-faces. Further, because water is incompressible, the density of water can be treated as constant. The mass of a cell is the volume * density. If we have cells that are all 1 meter × 1 meter, the volume is height × 1 meter × 1 meter. Putting this all together, everything is constant except for height, so we can replace mass with the height variable:
质量 = 体积 ·密度 = 高度 ·1 米 ·1 米 ·密度 = 常数 ·高度
Mass = Volume · Density = Height · 1 Meter · 1 Meter · Density = Constant · Height
同样使用 u = vx 和 v = vy,我们现在得到了浅水方程守恒定律的标准形式:
Also using u = vx and v = vy, we now get the standard form of the conservation law for the shallow water equations:
动量守恒类似,但动量 (mom) 取代了质量或高度。我们只显示 x 项以适合页面上的方程式,如下所示:
The conservation of momentum is similar but with momentum (mom) replacing the mass or height. We only show the x terms to fit the equation on the page like this:
1/2 gh2 的附加项是由于重力对系统所做的功。根据牛顿第二定律,外力产生额外的动量(F = 马)。我们将看看这个术语是如何在有微积分和没有微积分的情况下产生的。首先,在这种情况下,加速度是重力,它会产生作用在水柱上的力,如图 13.3 所示。每增加一米的水高会产生所谓的静水压力,导致整个水柱的压力更高。使用微积分,我们将沿柱子的压力进行积分,以产生动量。从 0 到波高 (h) 的海拔 (z) 上的积分将为
The additional term of 1/2 gh2 is due to the work done on the system by gravity. According to Newton’s second law, the external force creates additional momentum (F = ma). We’ll look at how this term comes about with and without calculus. First, the acceleration in this case is gravity, and it causes a force acting on the column of water as figure 13.3 shows. Each additional meter of water height creates what is known as hydrostatic pressure, resulting in a higher pressure along the whole column of water. With calculus, we would integrate the pressure along the column to get the momentum created. This integration over the elevation (z) from 0 to the wave height (h) would be
Figure 13.3 The force of gravity on the column of water creates flow and momentum.
Figure 13.4 The hydrostatic pressure caused by the force of gravity is a linear function of depth.
还有一个更简单的推导。在这种情况下,压力是一个线性函数(图 13.4)。如果我们查看高度中点,然后将高度中点处的压力差应用于整个色谱柱,我们可以得到相同的解决方案。我们正在做的是将曲线下的所有压力相加。数学术语是积分函数或执行黎曼和,将曲线下的面积分成几列,然后添加这些列。但这都是矫枉过正的。曲线下的面积是一个三角形,我们可以使用三角形的面积或 A = 1/2 bh。
There is also a much simpler derivation. In this case, the pressure is a linear function (figure 13.4). If we look at the height midpoint then apply the pressure difference at the height midpoint to the whole column, we can get the same solution. What we are doing is summing all of the pressure forces under the curve. The mathematical terminology for this is to integrate the function or perform a Riemann sum where you break the area under a curve into columns then add these. But this is all overkill. The area under the curve is a triangle, and we can use the area of a triangle or A = 1/2 bh.
Our resulting set of equations is
如果你仔细观察,你会注意到 x 动量方程中的 y 动量通量和 y 动量方程中 x 动量通量的交叉项。在 x 动量守恒中,第三项具有 x 动量 (胡) 以 y 速度 (v) 在 y 面上移动。您可以将其描述为 x 动量的平流或磁通量,其速度在 y 方向上穿过计算单元的顶面和底面。x 动量 (胡) 穿过 x 面的磁通量与速度 u 在第二项中为 胡2。
If you are observant, you will notice cross-terms of the momentum fluxes for the y -momentum in the x -momentum equation and x-momentum in the y-momentum equation. In the conservation of x -momentum, the third term has x-momentum (hu) moving across the y -face with the y-velocity (v). You can describe this as the advection, or flux, of the x-momentum with the velocity in the y-direction across the top and bottom faces of the computational cell. The flux of the x-momentum (hu) across the x-faces with the velocity u is in the second term as hu2.
我们还看到,新创建的动量被分成两个动量方程,新的 x 动量在 x 动量方程中,y 动量在 y 动量方程中。然后,这些方程在我们的浅水代码中实现为三个模板操作,为简单起见,我们使用 H = h、U = 胡 和 V = hv。现在,我们有一个简单的科学应用程序,可以用于我们的演示。
We also see that the newly created momentum is split across the two momentum equations with the new x-momentum in the x-momentum equation and the y-momentum in the y-momentum equation. These equations are then implemented as three stencil operations in our shallow water code, where for simplicity, we use H = h, U = hu, and V = hv. Now we have a simple scientific application that we can use for our demonstrations.
我们还有一个实现细节。我们使用一种数值方法来估计时间步中每个单元表面的质量和动量等属性。然后,我们使用这些估计值来计算在时间步长内移动到单元中的质量和动量。这为我们的数值解提供了更高的准确性。
We have one more implementation detail. We use a numerical method that estimates the properties such as mass and momentum at the faces of each cell halfway through the timestep. We then use these estimates to calculate the amount of mass and momentum that moves into the cell during the timestep. This gives us a little more accuracy for the numerical solution.
恭喜您完成了这次讨论并获得了一些理解。现在您已经了解了我们如何利用简单的物理定律并从中创建一个科学应用。您应该始终努力理解底层物理学和数值方法,而不是将代码视为一组循环。
Congratulations if you have worked your way through this discussion and gained some understanding. Now you have seen how we take the simple laws of physics and create a scientific application from those. You should always strive to understand the underlying physics and numerical method rather than treat the code as a set of loops.
接下来,我们进入浅水应用的分析步骤。为此,我们根据 13.3 节中介绍的数学和物理方程创建了一个浅水应用程序。在许多方面,该代码只是质量的三个模板计算和两个动量方程。从第 1 章开始,我们就一直在使用单个简单的模板方程式,示例代码包含在 https://github .com/EssentialsofParallelComputing/Chapter13 中。
Next, we reach the profiling step for the shallow water application. For this, we created a shallow water application based on the mathematical and physical equations presented in section 13.3. In many ways, the code is just three stencil calculations for the mass and two momentum equations. We have worked with a single, simple stencil equation since chapter 1, and the example code is included in https://github .com/EssentialsofParallelComputing/Chapter13.
在本节中,我们将向您展示如何运行浅水代码。我们将使用代码逐步完成将代码移植到 GPU 的示例工作流。首先,关于平台的一些说明:
In this section, we show you how to run the shallow water code. We’ll use the code to step through a sample workflow for porting your code to the GPU. First, some notes about the platforms:
macOS — NVIDIA 警告说,CUDA 10.2 可能是支持 macOS 的最后一个版本,并且仅支持 macOS v10.13。因此,NVVP 仅支持 macOS v10.13。它在 v10.14 中工作,但在 v10.15 (Catalina) 上完全失败。我们建议使用 VirtualBox https://www .virtualbox.org 作为免费虚拟机,在 Mac 系统上试用这些工具。我们还为 macOS 提供了一个 Docker 容器。
macOS—NVIDIA warns that CUDA 10.2 may be the last release to support macOS and only supports it up through macOS v10.13. As a result, NVVP is only supported through macOS v10.13. It sort of works with v10.14 but fails completely on v10.15 (Catalina). We suggest using VirtualBox https://www .virtualbox.org as a free virtual machine to try out the tools on Mac systems. We have also supplied a Docker container for macOS.
Windows — NVIDIA 仍原生支持 Microsoft Windows,但如果您愿意,也可以在 Windows 上使用 VirtualBox 或 Docker 容器。
Windows—NVIDIA still supports Microsoft Windows natively, but you can also use VirtualBox or Docker containers on Windows if you prefer.
Linux—A direct installation on most Linux systems should work.
如果您的本地系统上有 GPU,则可以使用本地工作流程。否则,您可能会在计算集群上远程运行并将文件传回进行分析。
If you have a GPU on your local system, you can use the local workflow. If not, you will probably be running remotely on a compute cluster and transferring the files back for analysis.
如果您想使用这些图形,则需要安装一些额外的软件包。在 Ubuntu 系统上,您可以使用以下命令执行此操作。第一个命令用于安装 OpenGL 和 freeglut 以用于实时图形。第二个是安装 ImageMagick® 来处理我们可以用于图形静止图像的图形文件输出。图形快照也可以转换为电影。GitHub 目录中的 README .graphics 文件包含有关本章附带示例中的图形格式和脚本的更多信息。
If you want to use the graphics, you will need to install some additional packages. On an Ubuntu system, you can do this with the following commands. The first command is for installing OpenGL and freeglut for real-time graphics. The second is for installing ImageMagick® to handle the graphics file output that we can use for graphics stills. The graphics snapshots can also be converted into movies. The README .graphics file in the GitHub directory has more information on the graphics formats and the scripts in the examples that accompany this chapter.
sudo apt-get install libglu1-mesa-dev freeglut3-dev mesa-common-dev -y sudo apt install cmake imagemagick libmagickwand-dev
sudo apt-get install libglu1-mesa-dev freeglut3-dev mesa-common-dev -y sudo apt install cmake imagemagick libmagickwand-dev
我们发现实时图形可以加速代码开发和调试,因此我们在本章随附的示例代码中包含了如何使用它们的示例。例如,实时图形输出使用 OpenGL 来显示网格中水的高度,从而为您提供即时的视觉反馈。实时图形代码还可以轻松扩展,以响应实时图形窗口中的键盘和鼠标交互。
We have found that real-time graphics can accelerate code development and debugging, so we included a sample of how to use them in the example code accompanying this chapter. For example, the real-time graphics output uses OpenGL to display the height of the water in the mesh, giving you immediate visual feedback. The real-time graphics code can also be easily extended to respond to keyboard and mouse interactions within the real-time graphics window.
此示例使用 OpenACC 进行编码,因此最好使用 PGI 编译器。由于 GCC 编译器对 OpenACC 的支持仍在开发中,因此这些示例的有限子集可以与 GCC 编译器一起使用。编译示例代码非常简单。我们只使用 CMake 和 make。
This example is coded with OpenACC, so it is best to use the PGI compiler. A limited subset of the examples works with the GCC compiler due to its still-developing support of OpenACC. Compiling the example code is straightforward. We just use CMake and make.
mkdir build && cd build cmake ..
mkdir build && cd build cmake ..
cmake -DENABLE_GRAPHICS=1
cmake -DENABLE_GRAPHICS=1
export GRAPHICS_TYPE=JPEG make
Set the graphics file format with
export GRAPHICS_TYPE=JPEG make
如果您无法让图形输出工作,程序在没有它的情况下将正常运行。但是,如果设置正确,代码的实时图形输出将显示如图 13.5 所示的图形窗口。图形每 100 次迭代更新一次。此处的图显示了比示例代码中硬编码大小更小的网格。这些线条表示左侧波高较高的计算单元。波随高度向右移动,随着移动而减小。波穿过计算域并从右侧表面反射。然后,它在网格中来回移动。在实际计算中,网格中会有对象(例如海岸线)。
If you cannot get the graphics output to work, the program will run fine without it. But if you get it set up correctly, the real-time graphics output from the code displays a graphics window like that shown in figure 13.5. The graphics are updated every 100 iterations. The figure here shows a smaller mesh than the hard-coded size in the sample code. The lines represent the computational cells with the wave height higher on the left. The wave travels to the right with the height, decreasing as it moves. The wave crosses the computational domain and reflects off the right face. Then it travels back and forth across the mesh. In a real calculation, there would be objects (such as shorelines) in the mesh.
图 13.5 浅水应用程序的实时图形输出。左侧的红色条纹表示波浪的开始,即山体滑坡进入水中的地方。波浪穿过海洋时向右发展:橙色、黄色、绿色和蓝色。如果您以黑白方式阅读本文,则左侧的阴影区域对应于红色,最右侧的阴影区域对应于蓝色。线条是计算单元的轮廓。
Figure 13.5 Real-time graphics output from the shallow water application. The red stripes on the left indicate the beginning of the wave, where the landslide enters the water. The wave progresses to the right as it cross the ocean: orange, yellow, green, and blue. If you’re reading this in black and white, the left shaded region corresponds to the red, and the far right shaded region corresponds to the blue. The lines are the outlines of the computational cells.
如果您有一个可以运行 OpenACC 的系统,则还将构建通过 ShallowWater_par4 ShallowWater_par1的可执行文件。您可以将这些用于下面的分析练习。
If you have a system that can run OpenACC, the executables ShallowWater_par1 through ShallowWater_par4 will also be built. You can use these for the profiling exercises that follow.
We described the parallel development cycle back in chapter 2 as
第一步是分析我们的应用程序。对于大多数应用程序,我们建议使用高级分析器,例如我们在 3.3.1 节中介绍的 Cachegrind 工具。Cachegrind 显示代码中最耗时的路径,并以易于解释的可视化表示形式显示结果。但是,对于像浅水应用程序这样的简单程序,像 Cachegrind 这样的函数级分析器是无效的。Cachegrind 显示 100% 的时间都花在 main 函数上,这对我们帮助不大。对于这种特殊情况,我们需要一个逐行分析器。为此,我们利用了 Unix 系统上最著名的分析器 gprof。稍后,当我们有在 GPU 上运行的代码时,我们将使用 NVIDIA NVVP 分析工具来获取性能统计数据。首先,我们只需要一个简单的工具来分析在 CPU 上运行的应用程序。
The first step is to profile our application. For most applications, we recommend using a high-level profiler such as the Cachegrind tool we introduced in section 3.3.1. Cachegrind shows the most time-consuming paths through the code and displays the results in an easy-to-interpret visual representation. However, for a simple program like the shallow water application, function-level profilers like Cachegrind are not effective. Cachegrind shows that 100% of the time is spent in the main function, which doesn’t help us much. We need a line-by-line profiler for this particular situation. For this purpose, we draw upon the most well-known profiler on Unix systems—gprof. Later, when we have code that runs on the GPU, we will use the NVIDIA NVVP profiling tool to get the performance statistics. To get started, we just need a simple tool to profile an application running on the CPU.
现在我们已经分析了应用程序并制定了计划,并行开发周期的下一步是开始实施计划。在此步骤中,我们开始对代码进行期待已久的修改。
Now that we have profiled the application and developed a plan, the next step in the parallel development cycle is to begin the implementation of the plan. In this step, we begin the eagerly awaited modification of the code.
实现首先通过移动计算循环将代码移植到 GPU 。我们按照 11.2.2 节中使用的相同过程将代码移植到 GPU。通过将 acc parallel loop pragma 插入每个循环的前面来移动计算,如以下清单中的第 95 行所示。
The implementation starts with porting the code to the GPU by moving the computation loop. We follow the same procedure used in section 11.2.2 to port the code to the GPU. The computations are moved by inserting the acc parallel loop pragma in front of every loop as shown on line 95 in the following listing.
Listing 13.1 Adding a loop directive
OpenACC/ShallowWater/ShallowWater_par1.c
95 #pragma acc parallel loop
96 for(int j=1;j<=ny;j++){
97 H[j][0]=H[j][1];
98 U[j][0]=-U[j][1];
99 V[j][0]=V[j][1];
100 H[j][nx+1]=H[j][nx];
101 U[j][nx+1]=-U[j][nx];
102 V[j][nx+1]=V[j][nx];
103 }OpenACC/ShallowWater/ShallowWater_par1.c
95 #pragma acc parallel loop
96 for(int j=1;j<=ny;j++){
97 H[j][0]=H[j][1];
98 U[j][0]=-U[j][1];
99 V[j][0]=V[j][1];
100 H[j][nx+1]=H[j][nx];
101 U[j][nx+1]=-U[j][nx];
102 V[j][nx+1]=V[j][nx];
103 }
我们还需要将循环末尾第 191 行的指针交换替换为数据副本。这并不理想,因为它引入了更多的数据移动,并且比指针交换慢。话虽如此,在 OpenACC 中进行指针交换是很棘手的,因为主机和设备上的指针必须同时切换。
We also need to replace the pointer swap on line 191 at the end of the loop with a data copy. This is not ideal because it introduces more data movement and is slower than a pointer swap. That being said, doing a pointer swap in OpenACC is tricky because the pointers on the host and device have to be switched simultaneously.
Listing 13.2 Replacing the pointer swap with a copy
OpenACC/ShallowWater/ShallowWater_par1.c
189 // Need to replace swap with copy
190 #pragma acc parallel loop
191 for(int j=1;j<=ny;j++){
192 for(int i=1;i<=nx;i++){
193 H[j][i] = Hnew[j][i];
194 U[j][i] = Unew[j][i];
195 V[j][i] = Vnew[j][i];
196 }
197 }OpenACC/ShallowWater/ShallowWater_par1.c
189 // Need to replace swap with copy
190 #pragma acc parallel loop
191 for(int j=1;j<=ny;j++){
192 for(int i=1;i<=nx;i++){
193 H[j][i] = Hnew[j][i];
194 U[j][i] = Unew[j][i];
195 V[j][i] = Vnew[j][i];
196 }
197 }
您将从可视化表示形式中获得有关应用程序性能的更好反馈。在该过程的每个步骤中,我们都会运行 NVVP 分析工具来获取性能跟踪的图形输出。
You will get better feedback of the performance of your application from a visual representation. At each step of the process, we run the NVVP profiling tool to get the graphical output of the performance trace.
图 13.6 显示了放大特定内核以更好地识别特定计算周期内性能指标的能力。具体来说,我们放大了清单 13.1 中的第 95 行,以显示单个内存副本。
Figure 13.6 shows the ability to zoom into specific kernels to better identify performance metrics within certain compute cycles. Specifically, we zoomed into line 95 from listing 13.1 to show individual memory copies.
图 13.6 使用 NVIDIA 的 NVVP,您可以在时间轴视图中放大特定副本。在这里,您可以看到每个周期中单个内存副本的放大版本。这使您可以查看这些线路上的内容,以帮助您轻松参考应用程序。
Figure 13.6 With NVIDIA’s NVVP, you can zoom into specific copies in the timeline view. Here, you can see a zoomed in version of individual memory copies within each cycle. This allows you to see what lines these are on to help you easily refer back to the application.
将代码移植到 GPU 的下一步是添加数据移动指令。这使我们能够通过消除昂贵的内存副本来进一步提高应用程序性能。在本节中,我们将向您展示它是如何完成的。
The next step in porting the code to the GPU is the addition of data movement directives. This allows us to further improve the application performance by eliminating expensive memory copies. In this section, we will show you how it’s done.
Visual Profiler NVVP 可以帮助我们了解需要将精力集中在哪些方面。首先查找较大的 MemCpy 时间块,并逐个消除这些时间块。当您去除数据传输成本时,您的代码将开始显示加速,从而恢复在应用 13.4.4 节中的 compute 指令期间损失的性能。
The Visual Profiler, NVVP, helps us to see where we need to focus our efforts. Start by looking for the large MemCpy time blocks and eliminating these one-by-one. As you remove the data transfer costs, your code will start to show speedups, recovering the performance lost during the application of compute directives in section 13.4.4.
在Listing 13.3中,我们展示了一个我们添加的数据移动指令的示例。在 data 部分的开头,我们使用 acc enter data create 指令来启动动态数据区域。然后,数据将存在于设备上,直到我们遇到 acc exit data 指令。对于每个循环,我们添加 present 子句以告诉编译器数据已在设备上。请参阅文件 OpenACC/ShallowWater/ShallowWater_par2.c 中第 13 章的示例代码,了解为控制数据移动所做的所有更改。
In listing 13.3, we show an example of the data movement directives that we added. At the start of the data section, we use the acc enter data create directive to start a dynamic data region. The data will then exist on the device until we encounter an acc exit data directive. For each loop, we add the present clause to tell the compiler the data is already on the device. Refer to the example code for the chapter 13 in the file OpenACC/ShallowWater/ShallowWater_par2.c for all the changes made to control the data movement.
Listing 13.3 Data movement directives
OpenACC/ShallowWater/ShallowWater_par2.c 51 #pragma acc enter data create( \ 52 H[:ny+2][:nx+2], U[:ny+2][:nx+2], V[:ny+2][:nx+2], \ 53 Hx[:ny][:nx+1], Ux[:ny][:nx+1], Vx[:ny][:nx+1], \ 54 Hy[:ny+1][:nx], Uy[:ny+1][:nx], Vy[:ny+1][:nx], \ 55 Hnew[:ny+2][:nx+2], Unew[:ny+2][:nx+2], Vnew[:ny+2][:nx+2]) <...> 59 #pragma acc parallel loop present( \ 60 H[:ny+2][:nx+2], U[:ny+2][:nx+2], V[:ny+2][:nx+2])
OpenACC/ShallowWater/ShallowWater_par2.c 51 #pragma acc enter data create( \ 52 H[:ny+2][:nx+2], U[:ny+2][:nx+2], V[:ny+2][:nx+2], \ 53 Hx[:ny][:nx+1], Ux[:ny][:nx+1], Vx[:ny][:nx+1], \ 54 Hy[:ny+1][:nx], Uy[:ny+1][:nx], Vy[:ny+1][:nx], \ 55 Hnew[:ny+2][:nx+2], Unew[:ny+2][:nx+2], Vnew[:ny+2][:nx+2]) <...> 59 #pragma acc parallel loop present( \ 60 H[:ny+2][:nx+2], U[:ny+2][:nx+2], V[:ny+2][:nx+2])
应用清单 13.3 中的数据移动指令并重新运行分析器,我们可以得到图 13.7 中的新性能结果,从中可以看到数据移动的减少。通过减少数据传输时间,应用程序的总体运行时间要快得多。在较大的应用程序中,您应该继续寻找其他数据传输操作,然后可以消除这些操作以进一步加快代码速度。
Applying the data movement directives from listing 13.3 and rerunning the profiler gives us the new performance results in figure 13.7, where you can see the reduction of data movement. By reducing the data transfer time, the overall run time of the application is much faster. In a larger application, you should continue looking for other data transfer operations that you can then eliminate to speed up the code even more.
图 13.7 NVIDIA 的 Visual Profiler NVVP 的时间线显示了计算的四次迭代,但现在进行了数据移动优化。这个图中有趣的不是你能看到什么,而是你看不到什么。上图中发生的数据移动急剧减少或不再存在。
Figure 13.7 This timeline from NVIDIA’s Visual Profiler NVVP shows four iterations of the computation but now with data movement optimizations. What is interesting in this figure is not so much what you can see, but what is not there. The data movement that occurred in the previous figure is sharply reduced or no longer exists.
为了进一步了解,NVVP 提供了一个引导式分析功能(图 13.8)。在本节中,我们将讨论如何使用此功能。
For further insight, NVVP provides a guided analysis feature (figure 13.8). In this section, we’ll discuss how to use this feature.
您必须根据对应用程序的了解来判断引导式分析中的建议。在我们的示例中,我们的数据传输很少,因此我们将无法获得图 13.8 中 Low Memcpy/Compute Overlap 的顶部建议中提到的内存复制和计算重叠。大多数其他建议都是如此。例如,对于低内核并发性,我们只有一个内核,因此我们不能有并发性。尽管我们的应用程序很小,可能不需要这些额外的优化,但这些优化是值得注意的,因为它们对于较大的应用程序很有用。
You must judge the suggestions from the guided analysis based on your knowledge of your application. In our example, we have few data transfers, so we will not be able to get memory copy and compute overlap mentioned in the top suggestion of Low Memcpy/Compute Overlap in figure 13.8. This is true of most of the other suggestions. For example, for low kernel concurrency, we only have one kernel, so we can’t have concurrency. Though our application is small and may not need these extra optimizations, these are good to note as they can be useful for larger applications.
图 13.8 NVVP 也提供了一个引导式分析部分。在这里,用户可以获得进一步优化的见解。请注意,突出显示的区域显示计算利用率较低。
Figure 13.8 NVVP provides a guided analysis section as well. Here, the user can acquire insight for further optimizations. Note that the highlighted region shows low compute utilization.
此外,图 13.8 显示了应用程序运行的低计算利用率。这并不罕见。这种低 GPU 利用率更能表明 GPU 上可用的巨大计算能力以及它可以做更多的事情。简单回顾一下我们的 Mixbench 性能工具的性能测量和分析(第 9.3.4 节),我们有一个带宽受限的内核,因此我们最多只能使用 GPU 浮点能力的 1-2%。有鉴于此,0.1% 的计算利用率还不错。
Additionally, figure 13.8 shows low compute utilization for our application run. This is not unusual. This low GPU utilization is more indicative of the huge compute power available on the GPU and how much more it can do. To briefly go back to the performance measurements and analysis of our mixbench performance tool (section 9.3.4), we have a bandwidth-limited kernel so we will, at best, use 1-2% of the GPU’s floating-point capability. In light of this, 0.1% compute utilization isn’t so bad.
NVVP 工具的另一个功能是 OpenACC 详细信息窗口,该窗口提供每个操作的时间。使用它的最佳方法之一是获取 13.9 所示的 before 和 after timings。并排比较为您提供了数据移动指令改进的具体度量。
Another feature of the NVVP tool is an OpenACC Details window that gives the timings for each operation. One of the best ways to use this is by acquiring the before and after timings as figure 13.9 shows. The side-by-side comparisons give you a concrete measurement of improvement from the data movement directives.
图 13.9 NVVP 的 OpenACC 详细信息窗口显示有关每个 OpenACC 内核的信息以及每个操作的成本。我们可以在左侧窗口中看到代码版本 1 的数据传输成本与右侧版本 2 中优化数据移动的时间。
Figure 13.9 NVVP’s OpenACC Details window shows information on each OpenACC kernel and the cost of each operation. We can see the cost of the data transfer in the left window for version 1 of the code versus the time for the optimized data motion in version 2 on the right.
打开 OpenACC Details 窗口后,您将注意到行号在配置文件中移动。如果我们查看ShallowWater_par1列表中的第 166 行(图 13.10 的左侧),它占用了 4.8% 的运行时间。操作明细表明,其中很大一部分是由于数据传输成本造成的。ShallowWater_par2 列表中的相应代码行是 181 (在图 13.10 的右侧),并添加了 present data 子句。我们可以看到,第 181 行的时间只有 0.81%,这主要是由于消除了数据传输成本。在这两种情况下,计算结构花费的时间大致相同,均为 0.16 毫秒,如突出显示的行下方标有 acc_compute_construct 的行所示。
With the OpenACC Details window opened, you’ll note that the line numbers move within the profile. If we look at line 166 in the ShallowWater_par1 listing (on the left in figure 13.10), it takes 4.8% of the run time. The breakdown of the operations shows that a lot of that time is due to data transfer costs. The corresponding line of code in the ShallowWater_par2 listing is 181 (on the right in figure 13.10) and has the addition of the present data clause. We can see that the time for line 181 is only 0.81% and that this is largely due to the elimination of the data transfer costs. The compute construct takes about the same time in both cases at 0.16 ms as shown in the line labeled acc_compute_construct just below the highlighted line.
图 13.10 并排代码比较显示 ShallowWater 代码版本 1 中的第 166 行现在是第 181 行,该行具有额外的 present 子句。
Figure 13.10 Side-by-side code comparison showing that line 166 in version 1 of the ShallowWater code is now line 181, which has the additional present clause.
NVIDIA 正在用 Nsight™ 工具套件替换其 Visual Profiler 工具(NVVP 和 nvprof)。该工具套件由两个集成开发环境 (IDE) 为基础:
NVIDIA is replacing their Visual Profiler tools (NVVP and nvprof) with the Nsight™ tool suite. The tool suite is anchored by two integrated development environments (IDEs):
Nsight Visual Studio Edition 支持在 Microsoft Visual Studio IDE 中进行 CUDA 和 OpenCL 开发。
Nsight Visual Studio Edition supports CUDA and OpenCL development in the Microsoft Visual Studio IDE.
Nsight Eclipse Edition adds the CUDA language to the popular open source Eclipse IDE.
图 13.11 显示了 Nsight Eclipse Edition 开发工具中的浅水应用程序。
Figure 13.11 shows our shallow water application in the Nsight Eclipse Edition development tool.
图 13.11 NVIDIA Nsight Eclipse Edition 应用程序是一个代码开发工具。工具中的此窗口显示 ShallowWater_par1 应用程序。
Figure 13.11 The NVIDIA Nsight Eclipse Edition application is a code development tool. This window in the tool shows the ShallowWater_par1 application.
Nsight 工具套件还具有单功能组件,可供注册的 NVIDIA 开发人员下载。这些分析器整合了 NVIDIA Visual Profiler 中的功能,并添加了其他功能。这两个组件是
The Nsight suite of tools also has single function components that can be downloaded by registered NVIDIA developers. These profilers incorporate the functionality from the NVIDIA Visual Profiler and add additional capabilities. The two components are
Nsight Systems, a system-level performance tool, looks at overall data movement and computation.
Nsight Compute, a performance tool, gives a detailed view of GPU kernel performance.
AMD 的 CodeXL 工具套件中还具有代码开发和性能分析功能。如图 13.12 所示,应用程序开发工具是一个功能齐全的代码工作台。CodeXL 还包括一个性能分析组件(在 Profile 菜单中),可帮助优化 AMD GPU 的代码。
AMD also has code development and performance analysis capabilities in their CodeXL suite of tools. As figure 13.12 shows, the application development tool is a full-featured code workbench. CodeXL also includes a profiling component (in the Profile menu) that helps with optimizing code for the AMD GPU.
图 13.12 CodeXL 开发工具支持编译、运行、调试和性能分析。
Figure 13.12 The CodeXL development tool supports compiling, running, debugging, and profiling.
NVIDIA 和 AMD 的这些新工具仍在推出中。功能齐全的工具(包括调试器和分析器)的推出将极大地推动 GPU 代码开发。
These new tools from NVIDIA and AMD are still being rolled out. The availability of full-featured tools, including debuggers and profilers, will be a tremendous boost for GPU code development.
与许多分析和性能测量工具一样,最初的信息量是压倒性的。您应该关注可以从硬件计数器和其他测量工具收集的最重要的指标。在最近的处理器中,硬件计数器的数量稳步增长,使您能够深入了解以前隐藏的处理器性能的许多方面。我们建议以下三个方面是最关键的:占用率、问题效率和内存带宽。
As with many profiling and performance measurement tools, the amount of information is initially overwhelming. You should focus on the most important metrics that you can gather from hardware counters and other measurement tools. In recent processors, the number of hardware counters has steadily grown, giving you insight into many aspects of processor performance that were previously hidden. We suggest the following three aspects as the most critical: occupancy, issue efficiency, and memory bandwidth.
占用率的概念经常被提及为 GPU 的首要关注点。我们首先在 10.3 节中讨论了这个度量。为了获得良好的 GPU 性能,我们需要足够的工作来保持计算单元 (CU) 繁忙。此外,当工作组遇到内存负载等待时,我们需要替代工作来填补停顿(图 13.13)。提醒一下,OpenCL 术语中的 CU 在 CUDA 中称为流式多处理器 (SM)。实际达到的入住率由测量计数器报告。如果您遇到低占用率措施,则可以修改内核中的工作组大小和资源使用情况,以尝试改进此系数。入住率越高并不总是越好。占用率只需要足够高,以便 CU 有替代工作。
The concept of occupancy is often mentioned as the top concern for GPUs. We first discussed this measure in section 10.3. For good GPU performance, we need enough work to keep compute units (CUs) busy. In addition, we need alternate work to cover stalls when workgroups hit memory-load waits (figure 13.13). As a reminder, CUs in OpenCL terminology are called streaming multiprocessors (SMs) in CUDA. The actual achieved occupancy is reported by the measurement counters. If you encounter low occupancy measures, you can modify the workgroup size and resource usage in the kernels to try and improve this factor. A higher occupancy is not always better. The occupancy just needs to be high enough so that there is alternate work for the CUs.
图 13.13 GPU 有很多计算单元 (CU),也称为流式多处理器 (SM)。我们需要创造大量工作来让 CU 保持忙碌,并有足够的额外工作来处理摊位。
Figure 13.13 GPUs have a lot of compute units (CUs), also called streaming multiprocessors (SMs). We need to create a lot of work to keep the CUs busy, with enough extra work for handling stalls.
Issue efficiency 是每个周期发出的指令与每个周期可能的最大可能指令的测量值。为了能够发出指令,每个 CU 调度器都必须准备好执行符合条件的波前或 warp。符合条件的波前是未停顿的活跃波前。从某种意义上说,这是拥有足够高的占用率的重要结果,因此有很多活跃的波前。指令可以是浮点、整数或内存操作。即使占用率很高,编写不佳且包含大量 stall 的 kernel 也会导致问题效率低下。kernel遇到 stalls 的原因有很多。还有一些计数器可以识别停顿的特定原因。一些可能性包括
Issue efficiency is the measurement of the instructions issued per cycle versus the maximum possible per cycle. To be able to issue an instruction, each CU scheduler must have an eligible wavefront, or warp, ready for execution. An eligible wavefront is an active wavefront that is not stalled. In some sense, it is an important result from having high enough occupancy so that there are lots of active wavefronts. The instructions can be floating point, integer or memory operations. Poorly written kernels with lots of stalls cause low-issue efficiency even if the occupancy is high. There are a variety of reasons for kernels to encounter stalls. There are also counters that can identify particular reasons for stalls. Some of the possibilities are
Execution dependency—Waiting on a previous instruction to complete
Memory throttle—Large number of outstanding memory operations
带宽是一个需要了解的重要指标,因为大多数应用程序都受到带宽限制。最好的起点是查看带宽度量。有许多内存计数器可用,让您可以根据需要进行深入操作。将您获得的带宽测量值与第 9.3.1 节到第 9.3.3 节中架构的理论和测量带宽性能进行比较,可以估计您的应用程序运行情况。您可以使用内存测量值来确定合并内存负载、将值存储在本地内存 (暂存器) 中或重组代码以重用数据值是否有帮助。
Bandwidth is an important metric to understand because most applications are bandwidth limited. The best starting point is to look at the bandwidth measure. There are many memory counters available, allowing you to go as deep as you want. Comparing your achieved bandwidth measurements to the theoretical and measured bandwidth performance for your architecture from sections 9.3.1 through 9.3.3 can give you an estimate on how well your application is doing. You can use the memory measurements to determine whether it would be helpful to coalesce memory loads, to store values in the local memory (scratchpad), or to restructure code to reuse data values.
您正在从某个地方飞往某个地方的航班上,只想让您的一些 GPU 代码正常工作。最新的软件版本不适用于您公司发放的笔记本电脑。解决方法是使用容器或虚拟机 (VM) 来运行不同的操作系统或不同的编译器版本。
You are on a flight from somewhere to nowhere and just want to get some of your GPU code working. The latest software release doesn’t work on your company-issued laptop. The workaround is to use a container or a virtual machine (VM) to run a different operating system or a different compiler version.
我们的每个章节都有一个示例 Dockerfile 及其使用说明。Dockerfile 包含用于构建基本操作系统,然后安装所需软件的命令。
Each of our chapters has an example Dockerfile and instructions for its use. The Dockerfile contains the commands to build a basic OS and then to install the necessary software that it needs.
Docker 容器可用于处理无法在操作系统上运行的软件。例如,对于仅在 Linux 上运行的软件,您可以在运行 Ubuntu 20.0.4 的 Mac 或 Windows 笔记本电脑上安装容器。使用容器适用于基于文本的命令行软件。
A Docker container is useful for dealing with software that does not work on your operating system. For example, for software that only runs on Linux, you can install a container on your Mac or Windows laptop that runs Ubuntu 20.0.4. Using a container works well for text-based, command-line software.
容器还限制对硬件设备(如 GPU)的访问。一种选择是在 CPU 上运行具有该功能的 GPU 语言的设备内核。这样做,我们至少可以测试我们的软件。如果这还不足以满足我们的需求,我们可以采取一些额外的步骤来尝试让图形和 GPU 计算正常工作。我们首先要看看图形如何工作。从 Docker 构建运行图形界面需要多花点时间。
Containers also limit access to hardware devices such as GPUs. One option is to run the device kernels on the CPU for the GPU languages that have that capability. Doing this, we can at least test our software. If that is not enough for our needs, we can tackle some additional steps to try and get the graphics and GPU computation working. We’ll start by looking at getting the graphics working. Running a graphical interface from a Docker build takes a little more effort.
对于需要 GUI 来制作工具或绘图的章节,说明略有不同。我们使用虚拟网络计算 (VNC) 软件通过 Web 界面和 VNC 客户端查看器启用图形功能。您必须使用 docker_run.sh 脚本启动 VNC 服务器,然后您需要在本地系统上启动 VNC 客户端。您可以使用各种 VNC 客户端软件包之一,也可以通过某些浏览器打开图形文件,并在浏览器工具栏站点名称中显示以下内容:
For chapters that require a GUI for tools or plots, the instructions are a little different. We use the Virtual Network Computing (VNC) software to enable the graphics capabilities through a web interface and VNC client viewers. You must use the docker_run.sh script to start the VNC server, then you need to start a VNC client on your local system. You can use one of a variety of VNC client packages, or you can open the graphics file through some browsers with the following in your browser toolbar site name:
http:/ /localhost:6080/vnc.html?resize=downscale&autoconnect=1&password= <password>"
http:/ /localhost:6080/vnc.html?resize=downscale&autoconnect=1&password= <password>"
要使用图形界面(如 NVVP)测试应用程序,请键入 nvvp。或者您可能希望使用简单的 X Window 应用程序(如 xclock 或 xterm)来测试图形。我们还可以尝试访问 GPU 进行计算。可以使用 —gpus 选项或较旧的 —device=/dev/<device name> 来获取对 GPU 的访问权限。该选项是 Docker 的一个相对较新的补充,目前仅针对 NVIDIA GPU 实现。
To test an application with a graphical interface such as NVVP, type nvvp. Or you might want to test the graphics with a simple X Window application such as xclock or xterm. We can also try to get access to the GPUs for computation. Access to the GPUs can be obtained by using the —gpus option or the older —device=/dev/<device name>. The option is a relatively new addition to Docker and, currently, is only implemented for NVIDIA GPUs.
大多数章节都有预构建的 Docker 容器。您可以在 https://hub.docker.com/u/essentialsofparallelcomputing 中访问每个章节的容器。您可以使用以下命令检索章节的容器:
Most of the chapters have prebuilt Docker containers. You can access the containers for each chapter at https://hub.docker.com/u/essentialsofparallelcomputing. You can retrieve the container for a chapter with the following command:
docker run -p 4000:80 -it --entrypoint /bin/bash essentialsofparallelcomputing/chapter2
docker run -p 4000:80 -it --entrypoint /bin/bash essentialsofparallelcomputing/chapter2
还有一个来自 NVIDIA 的预构建 Docker 容器,您可以将其用作自己的 Docker 映像的起点。请访问网站 https://github.com/NVIDIA/ nvidia-docker 以获取最新说明。NVIDIA 还有另一个站点,在 https://ngc.nvidia.com/catalog/containers 上提供了大量容器种类。对于 ROCm,在 onOpenCompute/ROCm-docker https://github.com/Rade 上提供了有关 Docker 容器的大量说明。Intel 在 https://github.com/intel/oneapi-containers 有一个网站,介绍如何在容器中设置他们的 oneAPI 软件。他们的一些基础容器很大,需要良好的互联网连接。
There is also a prebuilt Docker container from NVIDIA that you can use as a starting point for your own Docker images. Visit the site at https://github.com/NVIDIA/ nvidia-docker for up-to-date instructions. There is another site at NVIDIA with substantial container varieties at https://ngc.nvidia.com/catalog/containers. For ROCm, there are extensive instructions on Docker containers at https://github.com/Rade onOpenCompute/ROCm-docker. And Intel has a site for how to set up their oneAPI software in containers at https://github.com/intel/oneapi-containers. Some of their base containers are large and require a good internet connection.
PGI 编译器对于 OpenACC 代码开发和其他一些 GPU 代码开发挑战也很重要。如果您的工作需要 PGI 编译器,PGI 编译器的容器站点位于 https://ngc.nvidia.com/catalog/containers/ hpc:pgi-compilers。从此处提到的站点中可以看出,有许多资源可用于使用 Docker 容器创建工作环境。但这也是一种快速发展的能力。
The PGI compiler is important for OpenACC code development and some other GPU code development challenges as well. If you need the PGI compiler for your work, the container site for PGI compilers is at https://ngc.nvidia.com/catalog/containers/ hpc:pgi-compilers. As you can see from the sites mentioned here, there are many resources for creating work environments with Docker containers. But this is also a rapidly evolving capability.
使用虚拟机 (VM) 允许用户在自己的计算机中创建来宾操作系统。正常的操作系统称为主机,VM 称为来宾。您可以将多个 VM 作为 guest 运行。VM 对来宾操作系统使用比容器实施中更严格的环境。通常,与容器相比,设置 GUI 更容易。遗憾的是,访问 GPU 进行计算是困难的或不可能的。您可能会发现 VM 对于具有支持主机 CPU 处理器上计算的选项的 GPU 语言非常有用。
Using a virtual machine (VM) allows the user to create a guest OS within their own computer. The normal operating system is called a host, and the VM is called the guest. You can have more than one VM running as a guest. VMs use a more restrictive environment for the guest operating system than exists in the container implementations. Often, it is easier to set up GUIs in comparison to containers. Unfortunately, access to the GPU for computation is difficult or impossible. You might find VMs useful for GPU languages that have an option supporting computation on the host CPU processor.
让我们看看在 VirtualBox 中设置 Ubuntu 客户机操作系统的过程。此示例使用 VirtualBox with Graphics 中的 PGI 编译器设置在 CPU 上运行的浅水示例。
Let’s look at the process of setting up an Ubuntu guest operating system in VirtualBox. This example sets up the shallow water example running on the CPU with the PGI compiler in VirtualBox with graphics.
现在我们准备好安装 Ubuntu。该过程与在桌面上设置 Ubuntu 系统相同。
Now we are ready to install Ubuntu. The process is the same as setting up an Ubuntu system on your desktop.
每章都有设置虚拟机的说明和示例。在本章中,请重新登录并安装章节示例:
There are instructions for setting up virtual machines with the examples for each chapter. For this chapter, log back in and install the chapter examples:
git clone --recursive https:/ /github.com/essentialsofparallelcomputing/Chapter13.git cd Chapter13 && sh -v README.virtualbox
git clone --recursive https:/ /github.com/essentialsofparallelcomputing/Chapter13.git cd Chapter13 && sh -v README.virtualbox
README.virtualbox 文件中的命令将安装软件,并构建并运行浅水应用程序。实时图形输出也应该可以正常工作。您也可以尝试使用 nvprof 实用程序来分析浅水应用程序。
The commands in the README.virtualbox file install the software, and build and run the shallow water application. The real time graphics output should also work. You can also try the nvprof utility to profile the shallow water application as well.
当对特定 GPU 的访问受到限制时(没有超级计算机、笔记本电脑或台式机 GPU 或远程服务器),您可以使用云计算。1 云计算是指由大型数据中心提供的服务器。虽然这些服务中的大多数都是针对更普通的用户的,但一些迎合 HPC 风格服务的网站开始出现。其中一个网站是 http://mng.bz/Q2YG。Google Cloud Platform (GCP) 上的 Fluid Numerics Cloud 集群 (fluid-slurm-gcp) 设置具有 Slurm 批处理调度程序和 MPI。也可以安排 NVIDIA GPU。入门可能有点复杂。Fluid Numerics 站点在 http://mng .bz/XYwv 中提供了一些信息来帮助完成该过程。
When access to a specific GPU is limited (no supercomputer, laptop or desktop GPU, or remote server), you can make use of cloud computing.1 Cloud computing refers to servers provided by large data centers. While most of these services are for more general users, some sites catering towards HPC-style services are beginning to appear. One of these sites is http://mng.bz/Q2YG. The Fluid Numerics Cloud cluster (fluid-slurm-gcp) setup on the Google Cloud Platform (GCP) has the Slurm batch scheduler and MPI. NVIDIA GPUs can be scheduled as well. Getting started can be a bit complicated. The Fluid Numerics site has some information to help with that process at http://mng .bz/XYwv.
按需提供硬件资源的优势通常令人信服。Google Cloud 提供 300 美元的试用积分,这对于探索该服务来说应该绰绰有余。还有其他云提供商和附加服务可以准确地提供您所需的内容,或者您可以自己自定义环境。Intel 已经建立了一个用于测试 Intel GPU 的云服务,以便开发人员可以访问其 oneAPI 计划和提供 SYCL 实现的 DPCPP 编译器的软件和硬件。您可以通过访问 https://software.intel .com/en-us/oneapi 并注册来试用它。
The advantages of having hardware resources available on demand is often compelling. Google Cloud offers a $300 trial credit that should be more than sufficient for exploring the service. There are other cloud providers and add-on services that can provide exactly what you need, or you can customize the environment yourself. Intel has set up a cloud service for testing out Intel GPUs so that developers have access to both software and hardware for their oneAPI initiative and their DPCPP compiler that provides a SYCL implementation. You can try it out by going to https://software.intel .com/en-us/oneapi and registering to use it.
整合工作流程和开发环境对于 GPU 代码开发尤为重要。由于可能的硬件配置种类繁多,本章中介绍的示例可能需要根据您的情况进行一些自定义。事实上,开发系统的配置和设置是 GPU 计算的挑战之一。您甚至可能会发现,使用预构建的 Docker 容器之一比弄清楚在系统上配置和安装软件的过程更容易。
Incorporating a workflow and development environment is especially important for GPU code development. With the great variety of possible hardware configurations, the examples presented in this chapter will likely require some customization for your situation. Indeed, the configuration and setup of development systems is one of the challenges of GPU computing. You may even find that it is easier to use one of the prebuilt Docker containers rather than figure out the process to configure and install software on your system.
我们还建议从 13.8.1 节中建议的附加阅读材料中查看与您的需求相关的最新文档。工具和工作流程是 GPU 编程中变化最快的方面。虽然本章中的示例通常相关,但细节可能会发生变化。许多软件都是如此之新,以至于有关其使用的文档仍在开发中。
We also suggest checking the most recent documentation relevant to your needs from the additional reading suggested in section 13.8.1. The tools and workflows are the fastest changing aspects of GPU programming. While the examples in this chapter will be generally relevant, the details are likely to change. Much of the software is so new that documentation on its use is still being developed.
NVIDIA 安装手册包含有关使用包管理器安装 CUDA 工具的一些信息,网址为:
The NVIDIA installation manual has some information on installing CUDA tools using a package manager at:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation
NVIDIA 在以下站点提供了一些关于其分析工具以及从 NVVP 过渡到 Nsight 工具套件的资源:
NVIDIA has a couple of resources on their profiling tools and the transition from NVVP to the Nsight tool suite at the following sites:
https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#nvvp-guide 上的 NVIDIA NSight 指南
NVIDIA NSight Guide at https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#nvvp-guide
https://devblogs.nvidia.com/migrating-nvidia-nsight-tools-nvvp-nvprof/ 的 NVIDIA 分析工具比较
NVIDIA profiling tool comparison at https://devblogs.nvidia.com/migrating-nvidia-nsight-tools-nvvp-nvprof/
Other tools include the following:
CodeXL 已在 GPUopen 计划下作为开源发布。AMD 还从该工具中删除了其 AMD 品牌,以促进跨平台开发。有关详细信息,请参阅 https://github.com/GPUOpen-Tools/ CodeXL。
CodeXL has been released as open source under the GPUopen initiative. AMD has also removed its AMD brand from the tool to promote cross-platform development. For more information, see https://github.com/GPUOpen-Tools/ CodeXL.
NVIDIA 在 https://ngc.nvidia.com/catalog/containers/hpc:pgi-compilers 的容器中有一个 GPU Cloud,其中包含 PGI 编译器等资源。
NVIDIA has a GPU Cloud with resources such as the PGI compilers in a container at https://ngc.nvidia.com/catalog/containers/hpc:pgi-compilers.
AMD 还有一个关于设置虚拟化环境和容器的网页。虚拟化指令包括一种直通技术,用于访问 GPU 进行计算。您可以在 http://mng .bz/MgWW 中找到此信息
AMD also has a webpage on setting up virtualization environments and containers. The virtualization instructions include a passthrough technique to get access to the GPU for computation. You find this information at http://mng .bz/MgWW
在流 triad 示例上运行 nvprof。您可以尝试第 12 章中的 CUDA 版本或第 11 章中的 OpenACC 版本。您对硬件资源使用了什么工作流?如果您无法访问 NVIDIA GPU,可以使用其他分析工具吗?
Run nvprof on the stream triad example. You might try the CUDA version from chapter 12 or the OpenACC version from chapter 11. What workflow did you use for your hardware resources? If you don’t have access to an NVIDIA GPU, can you use another profiling tool?
Generate a trace from nvprof and import it into NVVP. Where is the run time spent? What could you do to optimize it?
Download a prebuilt Docker container from the appropriate vendor for your system. Start up the container and run one of the examples from chapter 11 or 12.
Improving performance is a high priority for scientific and big data applications. Performance tools can help you get the most out of your GPU hardware.
There are many profiling tools available for GPU programming. You should try out the many new and emerging capabilities that are available.
Workflows are essential for efficient GPU code development. Explore what works for you in your environment and the available GPU hardware.
通过使用容器、虚拟机和云计算来处理不兼容、计算需求和对 GPU 硬件的访问,有一些解决方法。这些解决方法可以访问大量 GPU 供应商硬件,否则这些硬件可能不可用。
There are workarounds through the use of containers, virtual machines, and cloud computing to handle incompatibilities, computing needs, and access to GPU hardware. These workarounds give access to a large sampling of GPU vendor hardware that might not otherwise be available.
使用当今的高性能计算 (HPC) 系统,仅仅学习并行编程语言是不够的。您还需要了解生态系统的许多方面,包括以下内容:
With today’s high performance computing (HPC) systems, it is not enough for you to just learn parallel programming languages. You also need to understand many aspects of the ecosystem including the following:
Requesting and scheduling resources using an HPC batch system
Writing and reading data in parallel on parallel file systems
Making full use of the tools and resources to analyze performance and assist software development
这些只是围绕核心并行编程语言的一些重要主题;形成一组互补的功能,我们称之为 HPC 生态系统。
These are just some of the important topics that surround the core parallel programming languages; forming a complementary set of capabilities we call the HPC ecosystem.
我们的计算系统在复杂性和内核数量方面呈指数级增长。HPC 中的许多考虑因素对于高端工作站也变得越来越重要。由于有这么多处理器内核,我们需要控制节点中进程的放置和调度,这种做法被松散地称为进程亲和性,并与 OS 内核一起完成。随着处理器内核数量的增加,用于控制进程亲和性的工具正在迅速开发,以帮助解决有关进程放置的新问题。我们将在第 14 章中介绍一些可用于分配进程亲和性的技术。
Our computing systems are exponentially growing in both complexity and the number of cores. Many of the considerations in HPC are also becoming important for high-end workstations. With so many processor cores, we need to control the placement and scheduling of processes within a node, a practice that is loosely called process affinity and done in conjunction with the OS kernel. As the number of cores on processors grows, the tools for controlling process affinity are quickly being developed to help with new concerns about process placement. We’ll cover some of the techniques that are available for assigning process affinity in chapter 14.
由于计算资源复杂性的增加,复杂的资源管理系统已经变得无处不在。这些 “批处理系统” 形成一个资源请求队列,并根据称为公平份额算法的优先级系统分配这些资源。当您第一次使用 HPC 系统时,批处理系统可能会令人困惑。如果不知道如何使用调度程序,就无法在这些大型计算机上部署应用程序。这就是为什么我们认为在第 15 章中回顾使用最常见批处理系统的基础知识是必不可少的。
Sophisticated resource management systems have become ubiquitous due to the growth in complexity of computing resources. These “batch systems” form a queue of requests for the resources and allocate these out, according to a priority system called a fair share algorithm. When you first get on an HPC system, the batch system can be confusing. Without knowing how to use a scheduler, you cannot deploy your applications on these large machines. This is why we think it’s essential to go over the basics of using the most common batch systems in chapter 15.
我们也不只是在 HPC 系统上以相同的方式写出文件;我们将这些并行写入特殊的文件系统硬件,这些硬件可以同时在多个磁盘上对文件写入进行条带化处理。要利用这些并行文件系统的强大功能,您需要了解一些用于并行文件操作的软件。在第 16 章中,我们将向您展示如何使用 MPI-IO 和 HDF5,它们是两个更常见的并行文件软件库。随着数据集越来越大,并行文件软件的潜在用途正在远远超出传统的 HPC 应用程序。
We also don’t just write out files the same way on HPC systems; we write these out in parallel to special filesystem hardware that can stripe the file writes across multiple disks simultaneously. For exploiting the power of these parallel filesystems, you need to learn about some of the software used for parallel file operations. In chapter 16, we show you how to use MPI-IO and HDF5, which are a couple of the more common parallel file software libraries. With data sets growing ever larger, the potential uses of parallel file software is expanding far outside the traditional HPC applications.
第 17 章介绍了面向 HPC 应用程序开发人员的广泛重要工具和资源。您可能会发现分析器在帮助应用程序性能方面具有很大的价值。有适用于不同用例和硬件(如 GPU)的各种分析器。还有一些工具可以帮助软件开发过程。这些工具允许您生成正确、强大的应用程序。此外,许多应用程序开发人员可以从各种示例应用程序中发现适用于其应用程序的专用方法。
Chapter 17 covers a broad range of important tools and resources for the HPC application developer. You might find profilers of great value in helping your application performance. There is a wide range of profilers for different use cases and hardware such as GPUs. There are also tools that help with the software development process. These tools allow you to produce correct, robust applications. Additionally, many application developers can discover specialized approaches for their application from the wide variety of sample applications.
随着我们计算平台的复杂性和规模的增长,HPC 生态系统的功能变得越来越重要。如何使用这些功能的知识经常被忽视。我们希望,通过在这四章中介绍高性能计算的这些经常被忽视的方面,您将能够更有效地利用您的计算硬件。
The capabilities of the HPC ecosystem are becoming more important as the complexity and scale of our computing platforms grow. The knowledge of how to use these capabilities has often been neglected. We hope that by covering these often overlooked aspects of high-performance computing in these four chapters, you will be able to get more productive use from your computing hardware.
我们第一次遇到 affinity 是在 MPI(消息传递接口)的 8.6.2 节中,我们在那里定义了它并简要展示了如何处理它。我们在这里重复定义,并定义进程放置。
We first encountered affinity in section 8.6.2 on the MPI (Message Passing Interface), where we defined it and briefly showed how to handle it. We repeat the definition here and also define process placement.
Affinity—Assigns a preference for the scheduling of a process, rank or thread to a particular hardware component. This is also called pinning or binding.
Placement—Assigns a process or thread to a hardware location.
在本章中,我们将更深入地介绍关联性、位置以及线程或等级的顺序。对亲和力的担忧是最近的现象。过去,每个 CPU 只有几个处理器内核,因此没有那么多收益。随着处理器数量的增加和计算节点的架构变得越来越复杂,关联性变得越来越重要。尽管如此,收益相对温和;也许最大的好处是减少不同运行的性能变化,并获得更好的节点扩展。有时,控制关联性可以避免内核根据应用程序的特性做出真正灾难性的调度决策。
We’ll go into more depth about affinity, placement, and the order of threads or ranks in this chapter. Concerns about affinity are recent phenomena. In the past, with just a few processor cores per CPU, there wasn’t that much to gain. As the number of processors grows and the architecture of a compute node gets more complicated, affinity has become more and more important. Still, the gains are relatively modest; perhaps the biggest benefit is in reducing the variation in performance from run to run and getting better on-node scaling. Occasionally, controlling affinity can avoid truly disastrous scheduling decisions by the kernel with respect to the characteristics of your application.
进程或线程的放置位置由 os 内核处理。内核调度有着悠久的历史,是开发多任务、多用户操作系统的关键。正是由于这些功能,您可以启动电子表格,暂时切换到文字处理器,然后处理一封重要的电子邮件。但是,为一般用户开发的调度算法并不总是适合并行计算。我们可以为四个处理器核心系统启动 4 个进程,但操作系统会按照它想要的任何方式安排这 4 个进程。它可以将所有四个进程放在同一个处理器上,也可以将它们分散到四个处理器上。通常,内核会执行一些合理的操作,但它可能会中断其中一个并行进程来执行系统功能,从而导致所有其他进程空闲并等待。
The decision of where to place a process or a thread is handled by the operating system kernel. Kernel scheduling has a rich history and is key to the development of multitasking, multi-user operating systems. It is due to these capabilities that you can fire up a spreadsheet, temporarily switch to a word processor, and then handle an important email. However, the scheduling algorithms developed for the general user are not always suitable for parallel computing. We can launch four processes for a four processor core system, but the operating system schedules those four processes any way it wants. It could place all four processes on the same processor, or it could spread them out across the four processors. Generally the kernel does something reasonable, but it can interrupt one of the parallel processes to perform a system function, causing all the other processes to idle and wait.
在第 1 章的图 1.20 和 1.21 中,我们显示了进程放置位置的问号,因为我们无法控制处理器或线程在处理器上的位置。至少到现在为止。最新版本的 MPI、OpenMP 和批处理计划程序已开始提供控制放置和关联性的功能。尽管某些界面中的选项发生了很多变化,但随着最新版本的发布,情况似乎正在稳定下来。但是,建议您查看所使用的版本的文档,以了解任何差异。
In chapter 1, figures 1.20 and 1.21, we showed question marks about where the processes get placed because we have no control over the placement of processors or threads on processors. At least until now. Recent releases of MPI, OpenMP, and batch schedulers have started to offer features to control placement and affinity. Although there is a lot of change in the options in some of the interfaces, things seem to be settling down with recent releases. However, you are advised to check the documentation for the releases that you use for any differences.
与大多数常见的桌面应用程序不同,并行进程需要一起调度。这称为 gang 调度。
Unlike most common desktop applications, parallel processes need to be scheduled together. This is referred to as gang scheduling.
定义 Gang 调度是一种内核调度算法,可以同时激活一组进程。
Definition Gang scheduling is a kernel scheduling algorithm that activates a group of processes at the same time.
由于并行进程通常在运行期间定期同步,因此计划最终等待另一个非活动进程的单个线程没有任何好处。内核调度算法没有表明一个进程依赖于另一个进程的操作的信息。MPI、OpenMP 线程和 GPU 内核也是如此。获取 gang scheduling 的最佳方法是仅分配与处理器数量一样多的进程,并将这些进程绑定到处理器。我们不能忘记内核和系统进程需要某个地方来运行。一些高级技术仅为系统进程保留处理器。
Because parallel processes generally synchronize periodically during a run, scheduling a single thread that ends up waiting on another process that is not active has no benefit. The kernel scheduling algorithm has no information that a process is dependent on another’s operation. This is true for MPI, OpenMP threads, and GPU kernels as well. The best approach for getting gang scheduling is to only allocate as many processes as there are processors and bind those processes to the processors. We cannot forget that the kernel and system processes need somewhere to run. Some advanced techniques reserve a processor just for system processes.
仅仅保持每个并行流程处于活动状态和计划是不够的。我们还需要将进程调度在同一个非一致性内存访问 (NUMA) 域上,以最大限度地降低内存访问成本。使用 OpenMP 时,我们通常会遇到很多麻烦,在使用数据的处理器上“首次接触”数据数组(请参阅第 7.1.1 节)。如果内核随后将您的进程移动到另一个 NUMA 域,那么您的努力都是徒劳的。我们在 7.3.1 节中看到,在错误的 NUMA 域中访问内存的惩罚通常是 2 倍或更多。我们的进程保持在同一内存域上是重中之重。
It is not enough to keep every parallel process active and scheduled. We also need to keep processes scheduled on the same Non-Uniform Memory Access (NUMA) domain to minimize memory access costs. With OpenMP, we typically go to a lot of trouble to “first touch” data arrays on the processor where the data is used (see section 7.1.1). If the kernel then moves your process to another NUMA domain, your efforts are all for naught. We saw in section 7.3.1 that the penalty for memory access in the wrong NUMA domain can typically be a factor of two or more. It is a top priority for our processes to stay on the same memory domain.
通常,NUMA 域与节点上的套接字对齐。如果我们能告诉进程在同一个 socket 上调度 affinity,我们将始终获得相同的最佳主内存访问时间。但是,是否需要 NUMA 区域关联性取决于您的 CPU 架构。个人计算系统通常只有一个 NUMA 区域,而大型 HPC 系统通常每个节点具有两个 CPU 插槽和两个或多个 NUMA 区域的处理核心要多得多。
Typically, a NUMA domain is aligned with the sockets on a node. If we can tell a process to schedule an affinity on the same socket, we’ll always get the same, optimal memory access time for main memory. The need for NUMA region affinity, however, is dependent on your CPU architecture. Personal computing systems often have only one NUMA region, while large HPC systems often have far more processing cores per node with two CPU sockets and two or more NUMA regions.
虽然将关联绑定到 NUMA 域可以优化我们对主内存的访问时间,但由于缓存使用率不佳,我们仍然可能获得低于最佳性能。进程用它需要的内存填充 L1 和 L2 高速缓存。但是,如果它被换出到同一 NUMA 域中具有不同 L1 和 L2 缓存的另一个处理器,缓存性能就会受到影响。然后需要再次填充缓存。如果您大量重用数据,这会导致性能损失。对于 MPI,我们希望将进程或列锁定到处理器。但是对于 OpenMP,这会导致所有线程都在同一处理器上启动,因为关联性是由生成的线程继承的。使用 OpenMP,我们希望每个线程都与其处理器有关联性。
While tying affinity to a NUMA domain optimizes our access time to main memory, we still can have less than optimal performance due to poor cache usage. A process fills the L1 and L2 cache with the memory that it needs. But then, if it gets swapped out to another processor on the same NUMA domain with a different L1 and L2 cache, cache performance suffers. The caches then need to be filled again. If you reuse data a lot, this causes a performance loss. For MPI, we want to lock processes or ranks to a processor. But with OpenMP, this causes all the threads to be launched on the same processor because the affinity is inherited by the spawned threads. With OpenMP, we want to have affinity for each thread to its processor.
一些处理器还具有一个称为超线程的新功能。超线程为进程放置注意事项增加了另一层复杂性。首先,我们需要定义超线程及其含义。
Some processors also have a new feature called hyperthreads. Hyperthreads add another layer of complexity to the process placement considerations. First we need to define hyperthreading and what it is.
定义 超线程是一种 Intel 技术,它通过在两个线程之间共享硬件资源,使单个处理器看起来是操作系统的两个虚拟处理器。
Definition Hyperthreading, an Intel technology, makes a single processor appear to be two virtual processors to the operating system through sharing of hardware resources between two threads.
超线程共享单个物理内核及其高速缓存系统。由于缓存是共享的,因此在超线程之间移动不会有那么大的损失。但这也意味着,如果进程没有任何共同数据,则每个虚拟内核都有一半的缓存作为真正的物理内核。对于内存受限的应用程序,将缓存减半可能是一个严重的打击。因此,这些虚拟核心的有效性喜忧参半。许多 HPC 系统会关闭它们,因为某些程序会因超线程而变慢。并非所有超线程在硬件或操作系统级别上都是相同的,因此不要假设如果您在以前的实现中没有看到好处,那么您在当前系统上也不会看到好处。如果我们使用超线程,我们希望进程放置靠近,以便共享缓存对两个虚拟处理器都有好处。
Hyperthreads share a single physical core and its cache system. Because the cache is shared, there isn’t as much penalty for movement between hyperthreads. But it also means that each virtual core has half the cache as a real physical core if the processes do not have any data in common. For our memory-bound applications, halving the cache can be a serious blow. Thus, the effectiveness of these virtual cores is mixed. Many HPC systems turn them off because some programs slow down with hyperthreads. Not all hyperthreads are equal either on the hardware or operating system level, so don’t assume that if you didn’t see a benefit on a previous implementation, you won’t on your current system. If we use hyperthreads, we’ll want the process placement to be close by so that the shared cache benefits both virtual processors.
为了利用 affinity 获得更好的性能,我们需要了解硬件架构的详细信息。硬件架构的多样性使这变得困难;仅 Intel 就有 1000 多种 CPU 型号。在本节中,我们将介绍如何理解您的架构。这是使用 affinity 来利用它之前的要求。
In order to leverage affinity for better performance, we need to know the details of our hardware architecture. The variety of hardware architectures makes this difficult; Intel alone has over a thousand CPU models. In this section, we introduce how to understand your architecture. This is a requirement before you can use affinity to exploit it.
您可以使用 lstopo 实用程序获得体系结构的最佳视图。我们首先在 3.2.1 节中看到 lstopo,图 3.2 中是 Mac 笔记本电脑的输出。笔记本电脑是一种简单的架构,具有四个物理处理内核,启用超线程后,在操作系统中显示为八个虚拟内核。我们还可以在图 3.2 中看到,L1 和 L2 缓存是物理内核专用的,而 L3 缓存在所有处理器之间共享。我们还注意到只有一个 NUMA 域。现在让我们看一下更复杂的 CPU。图 14.1 显示了 Intel Skylake Gold CPU 的架构。
You can get the best view of your architecture with the lstopo utility. We first saw lstopo in section 3.2.1 with the output for a Mac laptop in figure 3.2. The laptop is a simple architecture with four physical processing cores which, with hyperthreading enabled, appears as eight virtual cores to the operating systems. We can also see in figure 3.2 that the L1 and L2 caches are private to the physical core, and the L3 cache is shared across all of the processors. We also note that there is just one NUMA domain. Now let’s take a look at a more complicated CPU. Figure 14.1 shows the architecture for an Intel Skylake Gold CPU.
图 14.1 具有两个 NUMA 域和 88 个处理核心的 Intel Skylake Gold 架构揭示了高端计算节点的复杂性。
Figure 14.1 The Intel Skylake Gold architecture with two NUMA domains and 88 processing cores reveals the complexity of higher-end compute nodes.
图 14.1 中的灰色框是物理型芯,每个框都标有 core 并包含两个标有 PU 的浅色矩形,用于处理单元。这些灰色框内都有两个框,分别是超线程创建的虚拟处理器。L1 和 L2 缓存是每个物理处理器的专用缓存,而 L3 缓存在 NUMA 域中共享。我们还可以看到,图右侧的 network 和其他外围设备更接近第一个 NUMA 域。我们可以通过 lscpu 命令获得大多数 Linux 或 Unix 系统的一些信息(图 14.2)。
The gray boxes in figure 14.1, each labeled core and containing two light rectangles labled PU for processing unit, are physical cores. Each of these gray boxes has two boxes inside that are the virtual processors created by hyperthreads. The L1 and L2 caches are private to each physical processor, while the L3 cache is shared across the NUMA domain. We also can see that the network and other peripherals at the right of the figure are closer to the first NUMA domain. We can get some information on most Linux or Unix systems with the lscpu command (figure 14.2).
图 14.2 Intel Skylake Gold 处理器的 lscpu 命令输出。
Figure 14.2 Output from lscpu command for the Intel Skylake Gold processor.
lscpu 的输出确认每个内核有两个线程和两个 NUMA 域。处理器编号似乎有点奇怪,但是通过将前 22 个处理器放在第一个 NUMA 节点上,然后跳过以包括第二个节点上的接下来的 22 个处理器,我们将超线程放在最后编号。请记住,节点的 NUMA 实用程序定义与我们的定义不同,在我们的定义中,它是一个单独的分布式内存系统。
The output from lscpu confirms that there are two threads per core and two NUMA domains. The processor numbering seems a little odd, but by having the first 22 processors on the first NUMA node and then skipping to include the next 22 processors on the second node, we leave the hyperthreads to be numbered last. Remember that the NUMA utilities definition of a node is different than our definition, where it is a separate, distributed memory system.
那么,此体系结构的关联和进程放置策略是什么?嗯,这取决于应用程序。每个应用程序都有不同的扩展和线程性能需求,必须考虑这些需求。我们将需要观察我们将进程保留在其 NUMA 域中,以获得到主内存的最佳带宽。
So what is the strategy for affinity and process placement for this architecture? Well, it depends on the application. Each application has different scaling and threading performance needs that must be considered. We will want to watch that we keep processes in their NUMA domains to get the optimal bandwidth to main memory.
使用 OpenMP 优化应用程序时,线程关联性至关重要。将线程绑定到它使用的内存位置对于实现良好的内存延迟和带宽非常重要。我们付出了巨大的努力来做 first touch,以使内存靠近线程,正如我们在 7.1.1 节中讨论的那样。如果线程移动到不同的处理器,我们将失去应该从额外努力中获得的所有好处。
Thread affinity is vital when optimizing applications with OpenMP. Tying a thread to the location of the memory it uses is important to achieve good memory latency and bandwidth. We go to great effort to do first touch to get memory placed close to the thread as we discussed in section 7.1.1. If the threads are moving around to different processors, we lose all the benefits we should get from our extra effort.
在 OpenMP v4.0 中,OpenMP 的亲和性控件得到了扩展,除了现有的 true 或 false 选项外,还包括 close、spread 和 primary 关键字。此外,还添加了 OMP_PLACES 环境变量的三个选项:sockets、cores 和 threads。总之,我们现在有这些 affinity 和 placement 控件:
With OpenMP v4.0, the affinity controls for OpenMP were expanded to include the close, spread, and primary keywords, in addition to the existing true or false options. Also added were three options for the OMP_PLACES environment variable, sockets, cores, and threads. In summary, we now have these affinity and placement controls:
OMP_PLACES 对线程的调度位置施加了限制。实际上,有一个选项未列出:节点。它是默认设置,允许将每个线程调度到 “place” 中的任何位置。如果节点的默认位置上有多个线程,则调度程序可能会移动线程或与为一个虚拟处理器调度的两个或多个线程发生冲突。一种明智的方法是不要让线程数超过指定位置的数量。也许更好的规则是指定一个数量大于所需线程数的位置。我们将在本节后面的示例中展示其工作原理。
OMP_PLACES puts limits on where the threads can be scheduled. There is actually one option that is not listed: the node. It is the default and allows each thread to be scheduled anywhere in the “place.” With more than one thread on the default place of the node, the possibility exists that the scheduler will move the threads or have collisions with two or more threads scheduled for one virtual processor. One sensible approach is not to have more threads than the quantity of the specified place. Perhaps the better rule is to specify a place that has a quantity greater than the desired number of threads. We’ll show how that works in an example later in this section.
OMP_PROC_BIND 环境变量有五种可能的设置,但这些设置在含义上有一些重叠。close、spread 和 primary 设置是 true 的专用版本。
The OMP_PROC_BIND environment variable has five possible settings, but these have some overlap in meaning. The close, spread, and primary settings are specialized versions of true.
注意我们还注意到,primary 替换了 OpenMP v5.1 标准中已弃用的 master 关键字。在编译器实现新标准时,您可能会继续遇到旧用法。
Note We also note that primary replaces the deprecated master keyword as of the OpenMP v5.1 standard. You may continue to encounter the old usage as compilers implement the new standard.
使用 false 设置,内核调度程序可以自由移动线程。true 设置告诉内核在调度线程后不要移动线程。但它可以安排在 place constraint 内的任意位置,并且可能因运行而异。主要设置是在主处理器上调度线程的特殊情况。close 设置将线程安排在一起,而 spread 将分布线程。选择使用这两个设置中的哪一个具有一些微妙的含义,您将在本节的示例中看到这些含义。
With the false setting, the kernel scheduler is free to move threads around. The true setting tells the kernel not to move the thread once it gets scheduled. But it can be scheduled anywhere within the place constraint and can vary from run to run. The primary setting is a special case that schedules threads on the main processor. The close setting schedules the threads close together and spread distributes the threads. The choice of which of these two settings to use has some subtle implications that you will see in the example for this section.
注意您还可以使用详细列表设置位置。这是一个更高级的使用案例,我们不会在这里讨论。详细列表可以提供更精细的控制,但对于不同的 CPU 类型来说,它的可移植性较差。
Note You can also set the placement with a detailed list. This is a more advanced use case that we won’t go over here. The detailed list can give more fine-tuned control, but it is less portable to a different CPU type.
OpenMP 环境变量设置整个程序的关联和位置。您还可以通过在 parallel 指令上添加子句来设置各个循环的相关性。该子句的语法如下:
The OpenMP environment variables set the affinity and placement for the whole program. You can also set the affinity for individual loops through the addition of a clause on the parallel directive. The clause has this syntax:
proc_bind([primary|close|spread])
proc_bind([primary|close|spread])
以下示例显示了这些亲和力控制在第 7.3.1 节中的简单向量添加程序上的操作。还可以将 affinity-reporting 例程添加到您的代码中,以查看其中的影响。
The following example shows these affinity controls in operation on our simple vector addition program from section 7.3.1. The affinity-reporting routines can also be added to your code to see the impact there.
在放置报告例程中,我们查询 OpenMP 设置,报告这些设置,然后显示每个线程的放置和关联。要试用它,请使用 verbose 设置编译代码,并使用 44 个线程或在您的系统上有意义的任何线程数运行它,并且没有特殊的环境变量设置。示例代码位于 OpenMP 子目录中的 https://github.com/EssentialsofParallelComputing/Chapter14.git 处。
In the placement reporting routine, we query the OpenMP settings, report those, then show the placement and affinity for each thread. To try it out, compile the code with the verbose setting and run it with 44 threads or whatever number of threads makes sense on your system, and no special environment variable settings. The example code is at https://github.com/EssentialsofParallelComputing/Chapter14.git in the OpenMP subdirectory.
让我们看看当我们将线程放在硬件内核上并将关联绑定设置为 close 时会发生什么。
Let’s see what happens when we place the threads on hardware cores and set the affinity binding to close.
export OMP_PLACES=cores export OMP_PROC_BIND=close ./vecadd_opt3
export OMP_PLACES=cores export OMP_PROC_BIND=close ./vecadd_opt3
具有此 affinity 和 placement 设置的输出如图 14.3 所示。
The output with this affinity and placement settings is shown in Figure 14.3.
图 14.3 OMP_PLACES=cores 和 OMP_PROC_BIND=close 的关联和放置报告。每个线程可以在两个可能的虚拟内核上运行。由于超线程,这两个处理器属于单个硬件内核。
Figure 14.3 Affinity and placement report for OMP_PLACES=cores and OMP_PROC_BIND=close. Each thread can run on two possible virtual cores. These two processors belong to a single hardware core due to hyperthreading.
哇!我们实际上可以控制内核!线程现在被固定到属于单个硬件内核的两个虚拟内核。运行时间 0.0166 毫秒是输出中的最后一个数字。与上次运行的 0.0221 毫秒相比,此运行时间有了显著改进,计算时间缩短了 25%。您可以试验各种环境变量设置,并查看线程在节点上的放置方式。
Wow! We can actually control the kernel! The threads are now pinned to the two virtual cores belonging to a single hardware core. The run time of 0.0166 ms is the last number in the output. This run time is a substantial improvement over the 0.0221 ms in the previous run for a 25% reduction in the computation time. You can experiment with various environment variable settings and see how the threads are placed on the node.
我们将自动探索所有设置以及它们如何随不同数量的线程进行扩展。我们将关闭 verbose 选项以减少我们必须处理的输出。将仅打印运行时间。删除之前的构建并重新构建代码,如下所示:
We are going to automate the exploration of all the settings and how they scale with different numbers of threads. We’ll turn off the verbose option to reduce the output that we have to deal with. Only the run time will print. Remove the previous build and rebuild the code as follows:
mkdir build && cd build cmake .. make
mkdir build && cd build cmake .. make
We then run the script in the following listing to get the performance for all cases.
Listing 14.1 Script to automate exploring all settings
OpenMP/run.sh 1 #!/bin/sh 2 3 calc_avg_stddev() ❶ 4 { 5 #echo "Runtime is $1" 6 awk '{ 7 sum = 0.0; sum2 = 0.0 # Initialize to zero 8 for (n=1; n <= NF; n++) { # Process each value on the line 9 sum += $n; # Running sum of values 10 sum2 += $n * $n # Running sum of squares 11 } 12 print " Number of trials=" NF ", avg=" sum/NF ", \ std dev=" sqrt((sum2 - (sum*sum)/NF)/NF); 13 }' <<< $1 14 } 15 16 conduct_tests() ❷ 17 { 18 echo "" 19 echo -n `printenv |grep OMP_` ${exec_string} 20 foo="" 21 for index in {1..10} ❸ 22 do 23 time_result=`${exec_string}` 24 time_val[$index]=${time_result} 25 foo="$foo ${time_result}" 26 done 27 calc_avg_stddev "${foo}" 28 } 29 30 exec_string="./vecadd_opt3 " 31 32 conduct_tests 33 34 THREAD_COUNT="88 44 22 16 8 4 2 1" 35 36 for my_thread_count in ${THREAD_COUNT} ❹ 37 do 38 unset OMP_PLACES 39 unset OMP_PROC_BIND 40 export OMP_NUM_THREADS=${my_thread_count} 41 42 conduct_tests 43 44 PLACES_LIST="threads cores sockets" 45 BIND_LIST="true false close spread primary" 46 47 for my_place in ${PLACES_LIST} ❺ 48 do 49 for my_bind in ${BIND_LIST} ❻ 50 do 51 export OMP_NUM_THREADS=${my_thread_count} 52 export OMP_PLACES=${my_place} 53 export OMP_PROC_BIND=${my_bind} 54 55 conduct_tests 56 done 57 done 58 done
OpenMP/run.sh 1 #!/bin/sh 2 3 calc_avg_stddev() ❶ 4 { 5 #echo "Runtime is $1" 6 awk '{ 7 sum = 0.0; sum2 = 0.0 # Initialize to zero 8 for (n=1; n <= NF; n++) { # Process each value on the line 9 sum += $n; # Running sum of values 10 sum2 += $n * $n # Running sum of squares 11 } 12 print " Number of trials=" NF ", avg=" sum/NF ", \ std dev=" sqrt((sum2 - (sum*sum)/NF)/NF); 13 }' <<< $1 14 } 15 16 conduct_tests() ❷ 17 { 18 echo "" 19 echo -n `printenv |grep OMP_` ${exec_string} 20 foo="" 21 for index in {1..10} ❸ 22 do 23 time_result=`${exec_string}` 24 time_val[$index]=${time_result} 25 foo="$foo ${time_result}" 26 done 27 calc_avg_stddev "${foo}" 28 } 29 30 exec_string="./vecadd_opt3 " 31 32 conduct_tests 33 34 THREAD_COUNT="88 44 22 16 8 4 2 1" 35 36 for my_thread_count in ${THREAD_COUNT} ❹ 37 do 38 unset OMP_PLACES 39 unset OMP_PROC_BIND 40 export OMP_NUM_THREADS=${my_thread_count} 41 42 conduct_tests 43 44 PLACES_LIST="threads cores sockets" 45 BIND_LIST="true false close spread primary" 46 47 for my_place in ${PLACES_LIST} ❺ 48 do 49 for my_bind in ${BIND_LIST} ❻ 50 do 51 export OMP_NUM_THREADS=${my_thread_count} 52 export OMP_PLACES=${my_place} 53 export OMP_PROC_BIND=${my_bind} 54 55 conduct_tests 56 done 57 done 58 done
❶ Calculates average and standard deviation
❸ Repeats ten times to get statistics
❹ Loops over number of threads
❻ Loops over affinity settings
由于篇幅有限,我们只在图 14.4 中显示了其中的一小部分结果。所有值都是没有 affinity 或 placement 设置的单个线程的加速比。
Due to space, we show only a few of the results in figure 14.4. All of the values are the speedup from a single thread with no affinity or placement settings.
图 14.4 OMP_PROC_BIND=spread 的 OpenMP 关联和放置设置将并行缩放提高了 50%。这些行适用于特定设置的各种线程数,在图例中大致按从高到低的顺序排列。
Figure 14.4 OpenMP affinity and placement settings of OMP_PROC_BIND=spread boosts the parallel scaling by 50%. The lines are for various numbers of threads for a particular setting and are ordered roughly from high to low in the legend.
在我们的分析中,从图 14.4 中首先要注意的是,该程序通常是所有设置中最快的,只有 44 个线程。总的来说,超线程没有帮助。线程的 close 设置例外,因为在我们有超过 44 个具有此设置的线程之前,第二个套接字上没有进程。如果线程仅在第一个 socket 上,则它会限制可以获得的总内存带宽。在满 88 个线程时,线程的 close 设置可提供最佳性能,尽管只有一点点。通常,close 设置显示相同的内存带宽受限效果,因为第一个 socket 上只有线程。您还可以看到,在具有进程绑定的较大进程计数下,性能高于没有进程绑定的性能。
The first thing to note from figure 14.4 in our analysis is that the program is generally the fastest for all settings with only 44 threads. Overall, hyperthreading does not help. The exception is the close setting for threads because until we have more than 44 threads with this setting, there are no processes on the second socket. With threads only on the first socket, it limits the total memory bandwidth that can be obtained. At the full 88 threads, the close setting for threads gives the best performance, although by only a little bit. The close setting, in general, shows the same limited memory bandwidth effect due to only having threads on the first socket. You can also see that at larger process counts with process binding, the performance is higher than without process binding.
Some key points to take away from this analysis
Hyperthreading does not help with simple memory-bound kernels, but it also doesn’t hurt.
For memory-bandwidth-limited kernels on multiple sockets (NUMA domains), get both sockets busy.
我们不显示将 OMP_PROC_BIND 设置为 primary 的结果,因为它会强制所有线程位于同一处理器上,并使程序速度降低多达 2 倍。我们也不会显示 sockets 的 setting OMP_PLACES,因为它的性能低于显示的性能。
We don’t show the results for setting OMP_PROC_BIND to primary because it forces all the threads to be on the same processor and slows the program by as much as a factor of two. We also don’t show setting OMP_PLACES to sockets because it has lower performance than those shown.
如 Section 14.2 中所述,对 MPI 应用程序应用 affinity 也有好处。它通过防止进程作系统内核迁移到不同的处理器内核,帮助获得完整的内存带宽和缓存性能。我们将讨论 OpenMPI 的关联性,因为它具有最公开的关联性和进程放置工具。其他 MPI 实现(如 MPICH)必须在启用 SLURM 支持的情况下进行编译,这不适用于个人计算机。我们将在 Section 14.6 中讨论可用于更一般情况的命令行工具。现在,让我们继续探索 OpenMPI 中的亲和性!
There are also benefits to applying affinity with MPI applications as discussed in section 14.2. It helps to get full memory bandwidth and cache performance by keeping the processes from being migrated to different processor cores by the operating system kernel. We will discuss affinity with OpenMPI because it has the most publicly available tools for affinity and process placement. Other MPI implementations like MPICH must be compiled with SLURM support enabled, which isn’t as applicable to personal machines. We will discuss the command-line tools that can be used in more general situations in section 14.6. For now, let’s move onward with our exploration of affinity in OpenMPI!
OpenMPI 不是将进程放置留给内核调度程序,而是指定默认放置和关联性。OpenMPI 的默认设置因进程数而异。这些是
Rather than leaving process placement to the kernel scheduler, OpenMPI specifies a default placement and affinity. The default settings for OpenMPI vary depending on the number of processes. These are
某些 HPC 中心可能会设置其他默认值,例如始终绑定到内核。此绑定策略可能对大多数 MPI 作业有意义,但可能会导致同时使用 OpenMP 线程和 MPI 的应用程序出现问题。这些线程都将绑定到单个处理器,从而序列化线程。
Some HPC centers might set other defaults such as always binding to cores. This binding policy may make sense for most MPI jobs but can cause problems with applications using both OpenMP threading and MPI. The threads will all be bound to a single processor, serializing the threads.
最新版本的 OpenMPI 对进程放置和关联提供了广泛的支持。使用这些工具,您通常会获得性能提升。增益取决于操作系统中的进程调度程序如何优化放置。大多数计划程序都针对常规计算(如文字处理和电子表格)进行了优化,但针对并行应用程序不进行调整。哄骗调度程序 “做正确的事情” 可能会产生 5-10% 的好处,但可能会更多。
Recent versions of OpenMPI have extensive support for process placement and affinity. Using these tools, you usually get a performance gain. The gain depends upon how the process scheduler in the operating system is optimizing placement. Most schedulers are tuned for general computing, such as word processing and spreadsheets, but not parallel applications. Coaxing the scheduler to “do the right thing” potentially yields a benefit of 5-10%, but it can be a lot more.
对于大多数用例,使用简单的控件来放置进程并将其绑定到硬件组件就足够了。这些控件作为选项提供给 mpirun 命令。让我们首先看看如何在多节点作业中平均分配进程。通过示例来演示这一点是最容易的。
For most use cases, it is sufficient to use simple controls to place processes and to bind these to hardware components. These controls are supplied to the mpirun command as options. Let’s start with looking at distributing processes equally across a multi-node job. It is easiest to demonstrate this with an example.
对于应用程序的第一次运行,我们只需要求 mpirun 启动 32 个进程:
For our first run of our application, we simply ask mpirun to launch 32 processes:
mpirun -n 32 ./MPIAffinity | sort -n -k 4
mpirun -n 32 ./MPIAffinity | sort -n -k 4
然后,我们必须按第四列中的数据对输出进行排序,因为进程的输出顺序是随机的(通过命令 sort -n -k 4 完成)。这个命令和我们的 placement report 例程的输出如图 14.5 所示。
We then have to sort the output by the data in the fourth column because the order of output by processes is random (done by the command sort -n -k 4). The output for this command with our placement report routine is shown in figure 14.5.
图 14.5 对于 mpirun -n 32,我们所有的进程都在 cn328 节点上。affinity 设置为 NUMA 区域(套接字)。
Figure 14.5 For mpirun -n 32, all of our processes are on the cn328 node. The affinity is set to the NUMA region (socket).
从图 14.5 中的输出中,我们看到所有 rank 都是在节点 cn328 上启动的。参考本节开头的 OpenMPI 的默认关联设置,对于两个以上的等级,关联设置为绑定到套接字。lscpu 命令的输出显示我们的第一个 NUMA 区域包含虚拟处理核心 0-17、36-53。NUMA 区域通常与每个插槽对齐。在我们的输出中,我们看到 core affinity 等于 0-17, 36-53,确认 affinity 已设置为套接字。
From the output in figure 14.5, we see that all the ranks were launched on node cn328. Referring to the default affinity settings for OpenMPI at the start of this section, for more than two ranks the affinity is set to bind to the socket. The output from the lscpu command shows our first NUMA region contains the virtual processing cores 0-17, 36-53. NUMA regions are usually aligned with each socket. In our output, we see that the core affinity equals 0-17, 36-53, confirming that the affinity was set to the socket.
由于我们的实际应用程序内存要求大于节点上的 128 GiB,因此在分配内存时失败。因此,我们需要找到一种方法来分散这些过程。为此,我们添加了另一个选项 —npernode <#> 或 -N <#>,它告诉 MPI 在每个节点上放置多少个等级。我们需要有 4 个节点才能为我们的问题获取足够的内存,因此我们希望每个节点有 8 个进程。
Because our real application memory requirements are larger than the 128 GiB on the node, it fails when allocating memory. We thus need to find a way to spread out the processes. For this, we add another option, —npernode <#> or -N <#>, which tells MPI how many ranks to put on each node. We need to have four nodes to get enough memory for our problem, so we want eight processes per node.
mpirun -n 32 --npernode 8 ./MPIAffinity | sort -n -k 4
mpirun -n 32 --npernode 8 ./MPIAffinity | sort -n -k 4
Figure 14.6 shows our placement report.
图 14.6 MPI 进程分布在 cn328 到 331 四个节点上。相关性仍与 NUMA 区域相关联。
Figure 14.6 The MPI processes are spread out across the four nodes, cn328 through 331. The affinity is still tied to the NUMA region.
从图 14.6 中的输出中,我们可以看到我们在四个节点上运行。我们现在应该有足够的内存来运行我们的应用程序。或者,我们可以使用 —npersocket 指定每个套接字的秩数。每个节点有两个 socket,所以我们希望每个 socket 有四个 rank,因此:
From the output in figure 14.6, we can see that we are running on four nodes. We should now have enough memory to run our application. Alternatively, we could specify how many ranks per socket with —npersocket. We have two sockets per node, so we want four ranks per socket, thus:
mpirun -n 32 --npersocket 4 ./MPIAffinity | sort -n -k 4
mpirun -n 32 --npersocket 4 ./MPIAffinity | sort -n -k 4
图 14.7 显示了每个 socket 的 placement 输出。
Figure 14.7 shows the output from the placement per socket.
图 14.7 将 placement 设置为每个 socket 四个进程后,rank的顺序会发生变化。现在,四个相邻的列位于同一 NUMA 区域。
Figure 14.7 With the placement set to four processes per socket, the order of the ranks changes. Now the four adjacent ranks are on the same NUMA region.
图 14.7 中的 placement report 显示,等级的顺序将相邻的等级放置在同一个 NUMA 域上,而不是在 NUMA 域之间交替排列。如果 rank 与最近的邻居通信,那可能会更好。
The placement report in figure 14.7 shows that the order of the ranks places adjacent ranks on the same NUMA domain instead of alternating the ranks between NUMA domains. That might be better if ranks are communicating with nearest neighbors.
到目前为止,我们只处理了进程的放置。现在,让我们尝试看看我们能对 MPI 进程的亲和力做些什么。为此,我们将 —bind-to [socket | numa | core | hwthread] 选项添加到 mpirun:
So far, we have only worked on the placement of processes. Now let’s try to see what we can do about the affinity and binding of the MPI processes. For this, we add the —bind-to [socket | numa | core | hwthread] option to mpirun:
mpirun -n 32 --npersocket 4 --bind-to core ./MPIAffinity | sort -n -k 4
mpirun -n 32 --npersocket 4 --bind-to core ./MPIAffinity | sort -n -k 4
我们可以在图 14.8 中看到这如何改变 placement report 中进程的 affinity。
We can see how this changes the affinity for the processes in the placement report in figure 14.8.
图 14.8 从绑定到内核的关联性会更改进程到硬件内核的关联性。由于超线程,每个硬件内核代表两个虚拟内核。我们为每个流程提供两个位置。
Figure 14.8 The affinity from binding to a core changes the affinity for the processes to a hardware core. Each hardware core represents two virtual cores because of hyperthreading. We get two locations for each process.
图 14.8 中的放置结果显示,进程亲和性现在受到的限制比以前更多。每个进程都可以计划在两个虚拟内核上运行。这两个虚拟内核属于一个硬件内核,因此表明 core binding 选项指的是一个硬件内核。每个插槽上的 18 个处理器内核中只有 4 个被使用。这就是我们想要的,以便每个 MPI 等级都有更多内存。让我们尝试使用 hwthread 选项将进程绑定到超线程而不是核心。这应该强制调度程序将进程放置在一个且只有一个虚拟核心上。
The placement results in figure 14.8 show that the process affinity is now restricted more than it was previously. There are two virtual cores that each process can schedule to run on. These two virtual cores belong to one hardware core, thus showing that the core binding option refers to a hardware core. Only four of the 18 processor cores on each socket are used. This is what we want so that there is more memory for each MPI rank. Let’s try binding the process to the hyperthreads instead of to the core by using the hwthread option. This should force the scheduler to place processes on one, and only one, virtual core.
mpirun -n 32 --npersocket 4 --bind-to hwthread ./MPIAffinity | sort -n -k 4
mpirun -n 32 --npersocket 4 --bind-to hwthread ./MPIAffinity | sort -n -k 4
同样,我们使用 placement report 程序来可视化 placement,输出如图 14.9 所示。
Again, we use our placement report program to visualize the placement with the output shown in figure 14.9.
图 14.9 hwthread 选项中的进程放置限制了进程只能运行到一个位置的位置。
Figure 14.9 The process placement from the hwthread option limits where the processes can run to only one location.
我们最后一个处理器布局最终将每个进程可以运行的位置限制为单个位置,如图 14.9 所示。这似乎是一个不错的结果。但是等等。仔细看看。前两个 rank 位于单个硬件内核的一对超线程 (0 和 36) 上。这不是一个好主意。这意味着这两个等级共享该硬件核心的高速缓存和硬件组件,而不是拥有自己的全部资源。
Our last processor layout finally restricts where each process can run to a single location as shown in figure 14.9. That seems like a good result. But wait. Take a closer look. The first two ranks are placed on the pair of hyperthreads (0 and 36) of a single hardware core. This is not a good idea. That means the two ranks are sharing the cache and hardware components of that hardware core instead of having their own full complement of resources.
OpenMPI 中的 mpirun 命令还具有用于报告绑定的内置选项。对于小问题来说很方便,但是对于具有大量处理器和 MPI 秩的节点,输出量很难处理。将 —report-bindings 添加到用于图 14.9 的 mpirun 命令中,将生成图 14.10 中所示的输出。
The mpirun command in OpenMPI also has a built-in option to report bindings. It is convenient for small problems, but the amount of output for nodes with a lot of processors and MPI ranks is hard to handle. Adding —report-bindings to the mpirun command used for figure 14.9 produces the output shown in figure 14.10.
图 14.10 从 —report-bindings 选项到 mpirun 的放置报告显示了排名与字母 B 绑定的位置。
Figure 14.10 Placement report from the —report-bindings option to mpirun shows where ranks are bound with the letter B.
视觉布局更容易快速理解,并且输出中包含大量信息。每行表示 MPI_COMM_WORLD 中的排名 (MCW)。右侧正斜杠之间的符号表示该进程的绑定位置。正斜杠符号之间的两个点集表明每个内核有两个超线程。两组括号描述了节点上的两个套接字。
The visual layout is a little easier to quickly understand, and there is a lot of information packed into the output. Each line indicates a rank in MPI_COMM_WORLD (MCW). The symbols between the forward slashes on the right side indicate the binding location for that process. The set of two dots between the forward slash symbols shows that there are two hyperthreads per core. The two sets of brackets delineate the two sockets on the node.
通过我们在本节中探讨的示例,您应该可以了解如何控制 placement 和 affinity。您还应该有一些工具来检查您是否获得了预期的 placement 和 process bindings。
With the examples we explored in this section, you should be getting an idea of how to control placement and affinity. You should also have some tools to check that you are getting the placement and process bindings you expect.
现在,我们将探索并行计算的 affinity 的完整图景。我们将以此作为介绍 OpenMPI 中提供的高级选项的一种方式,以实现更多控制。
Now we will explore the full picture of affinity for parallel computing. We will use this as a way of introducing the advanced options offered in OpenMPI for even more control.
affinity 的概念源于操作系统看待事物的方式。在操作系统级别,您可以设置允许每个进程运行的位置。在 Linux 上,这是通过 taskset 或 numactl 命令完成的。这些命令以及其他操作系统上的类似实用程序随着 CPU 复杂性的增加而出现,以便您可以向操作系统中的调度程序提供更多信息。调度程序可能会将这些方向视为提示或要求。使用这些命令,您可以将服务器进程固定到特定处理器,以便更接近特定硬件组件或获得更快的响应。在处理单个进程时,仅关注相关性就足够了。
The concept of affinity is born out of how the operating system sees things. At the level of the operating system, you can set where each process is allowed to run. On Linux, this is done through either the taskset or the numactl commands. These commands, and similar utilities on other operating systems, emerged as the complexity of the CPU grew so that you could provide more information to the scheduler in the operating system. The directions might be taken as hints or requirements by the scheduler. Using these commands, you can pin a server process to a particular processor to be closer to a particular hardware component or to gain faster response. This focus on affinity alone is enough when dealing with a single process.
对于并行编程,还有其他注意事项。我们有一系列需要考虑的流程。假设我们有 16 个处理器,并且我们正在运行一个 4 级 MPI 作业。我们把排名放在哪里?我们是把这些放在插座上,放在所有的插座上,把它们紧紧地挤在一起,还是把它们摊开?我们是否将某些等级彼此相邻(等级 1 和 2 放在一起,还是等级 1 和 4 放在一起)?为了能够回答这些问题,我们需要解决以下问题:
For parallel programming, there are additional considerations. We have a set of processes that we need to consider. Lets say we have 16 processors and we are running a four rank MPI job. Where do we put the ranks? Do we put these across the sockets, on all the sockets, pack them close together, or spread them out? Do we place certain ranks next to each other (ranks 1 and 2 together or ranks 1 and 4 together)? To be able to answer these questions, we need to address the following:
我们将依次介绍每个选项,以及 OpenMPI 如何让您控制这些事情。
We’ll go over each in turn, along with how OpenMPI allows you to control these things.
Mapping processes to processors or other locations
在考虑并行应用程序时,我们有一组进程和一组处理器。我们如何将进程映射到处理器?在贯穿 14.4.2 节的示例中,我们希望将进程分布在四个节点上,以便每个进程都比在单个节点上拥有更多的内存。OpenMPI 中映射进程的更通用形式是 -mapby hwresource,其中参数 hwresource 是大量硬件组件中的任何一个。最常见的包括:
When thinking about a parallel application, we have a set of processes and a set of processors. How do we map the processes to the processors? In the example used throughout section 14.4.2, we wanted to spread the processes over four nodes so that every process has more memory than it would if it were on a single node. The more general form for mapping processes in OpenMPI is -mapby hwresource, where the argument hwresource is any of a large number of hardware components. The most common include the following:
--map-by [slot | hwthread | core | socket | numa | node]
--map-by [slot | hwthread | core | socket | numa | node]
使用 mpirun 命令的 —map-by 选项,进程将以循环方式分布在此硬件资源中。该选项的默认值为 socket。这些硬件位置中的大多数都是不言自明的,但插槽除外。槽是环境、调度程序或主机文件中进程的可能位置列表。这种形式的 —map-by 选项的含义和效果仍然受到限制。
With the —map-by option to the mpirun command, the processes are distributed in a round-robin fashion across this hardware resource. The default for the option is socket. Most of these hardware locations are self-explanatory except for slot. Slots are the list of possible locations for processes from the environment, the scheduler, or a host file. This form of the —map-by option is still limited in its meaning and, therefore, its effect.
更通用的形式使用一个名为 ppr 或 processes per resource 的选项,其中 n 是进程数。您可以为每个硬件资源指定一个进程块,而不是按资源进行循环映射:
A more general form uses an option called ppr or processes per resource, where n is the number of processes. Instead of a round-robin mapping by resource, you can specify a block of processes per hardware resource:
--map-by ppr:n:hwresource
--map-by ppr:n:hwresource
--map-by ppr:n:[slot | hwthread | core | socket | numa | node]
--map-by ppr:n:[slot | hwthread | core | socket | numa | node]
在前面的示例中,我们使用了更简单的选项 —npernode 8。在这种更一般的形式中,它是
In our earlier examples, we used the simpler option of —npernode 8. In this more general form, it would be shorthand for
--map-by ppr:8:node
--map-by ppr:8:node
如果从上述选项到 mpirun 的控制级别不够,则可以使用 —cpu-list <逻辑处理器编号>选项指定要映射的处理器编号列表,其中处理器编号是与 lstopo 或 lscpu 中的列表相对应的列表。此选项还会同时将进程绑定到逻辑 (虚拟) 处理器。
If the level of control from the previous options to mpirun is not sufficient, you can specify a list of processor numbers to map with the —cpu-list <logical processor numbers> option, where the processor numbers are a list that corresponds to the list from lstopo or lscpu. This option also binds the processes to the logical (virtual) processor at the same time.
您可能想要控制的另一件事是 MPI 排名的顺序。如果相邻的 MPI 秩彼此之间通信频繁,则您可能希望它们在物理处理器空间中彼此靠近。这降低了这些等级之间的通信成本。通常,在 mapping 期间使用分布的块大小来控制这一点就足够了,但您可以使用 —rank-by 选项获得额外的控制:
Another thing you might want to control is the ordering of your MPI ranks. You may want adjacent MPI ranks to be close to each other in physical processor space if they communicate a lot with each other. This reduces the cost of the communication between these ranks. Usually, it is sufficient to control this with the block size of the distribution during mapping, but you can get additional control with the —rank-by option:
--rank-by ppr:n:[slot | hwthread | core | socket | numa | node]
--rank-by ppr:n:[slot | hwthread | core | socket | numa | node]
An even more general option is to use a rank file:
--rankfile <filename>
--rankfile <filename>
虽然您可以使用这些命令微调 MPI 排名的位置,并且可能会提高性能提高几个百分点,但很难想出最佳公式。
While you can fine-tune the placement of your MPI ranks with these commands and perhaps gain a couple of percent in performance, it is difficult to come up with the optimum formula.
Binding processes to hardware components
最后一个要控制的部分是 affinity 本身。关联是将进程绑定到硬件资源的过程。该选项与前面的选项类似:
The last piece to control is affinity itself. Affinity is the process of binding the process to the hardware resource. The option is similar to the previous ones:
--bind-to [slot | hwthread | core | socket | numa | node]
--bind-to [slot | hwthread | core | socket | numa | node]
对于大多数 MPI 应用程序,core 的默认设置就足够了(如果没有 —bind-to 选项,则默认值为 socket,用于两个以上的进程,如第 14.4.1 节所述)。但在某些情况下,该 affinity 设置会导致问题。
The default setting of core is sufficient for most MPI applications (without the —bind-to option the default is socket for greater than two processes as mentioned in section 14.4.1). But there are cases where that affinity setting causes problems.
正如我们在图 14.8 的示例中看到的那样,affinity 被设置为硬件内核上的两个超线程。我们可能想尝试 —map-to core—bind-to hwthread 在内核之间分配进程,但将每个进程更紧密地绑定到单个超线程。这种微调的性能差异可能很小。当我们尝试实现混合 MPI 和 OpenMP 应用程序时,更大的问题就来了。请务必认识到,子进程会继承其父进程的关联设置。如果我们使用 npersocket 4 的选项—bind-to core,然后启动两个线程,那么线程有两个运行位置(每个内核两个超线程),所以我们没问题。如果我们启动四个线程,这些线程将仅共享两个逻辑处理器位置,并且性能将受到限制。
As we saw in the example for figure 14.8, the affinity is set to the two hyperthreads on the hardware core. We might want to try —map-to core—bind-to hwthread to distribute the processes across the cores but bind each process more tightly to a single hyperthread. The performance difference from such fine-tuning is probably small. The greater problem comes when we try to implement a hybrid MPI and OpenMP application. It is important to realize that child processes inherit the affinity settings of their parent. If we use the options of npersocket 4—bind-to core and then launch two threads, we have two locations for the threads to run (two hyperthreads per core), so we are Ok. If we launch four threads, these will share only two logical processor locations and performance will be limited.
我们在本节前面看到,有很多选项可用于控制进程、放置和关联。事实上,有太多的组合,甚至无法像我们在 OpenMP 的第 14.3 节中所做的那样进行全面探索。在大多数情况下,我们应该对获得反映应用程序需求的合理设置感到满意。
We saw earlier in this section that there are a lot of options for controlling process, placement, and affinity. Indeed, there are too many combinations to even fully explore as we did in section 14.3 for OpenMP. In most cases, we should be satisfied with getting reasonable settings that reflect the needs of our applications.
本节的目标是了解如何为混合 MPI 和 OpenMP 应用程序设置关联性。为这些混合情况获得正确的关联可能很棘手。在本次探索中,我们创建了一个具有 MPI 和 OpenMP 的混合流三元组示例。我们还修改了本章中使用的 placement report,以输出混合 MPI 和 OpenMP 应用程序的信息。下面的清单显示了修改后的子例程 place_report_ mpi_omp.c。
Our goal in this section is to understand how to set affinity for hybrid MPI and OpenMP applications. Getting affinity right for these hybrid situations can be tricky. For this exploration, we’ve created a hybrid stream triad example with MPI and OpenMP. We have also modified the placement report used throughout this chapter to output information for hybrid MPI and OpenMP applications. The following listing shows the modified subroutine, place_report_ mpi_omp.c.
清单 14.2 MPI 和 OpenMP 放置报告工具混合流三元组
Listing 14.2 MPI and OpenMP placement reporting tool hybrid stream triad
StreamTriad/place_report_mpi_omp.c
41 void place_report_mpi_omp(void)
42 {
43 int rank;
44 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
45
46 int socket_global[144];
47 char clbuf_global[144][7 * CPU_SETSIZE];
48
49 #pragma omp parallel
50 {
51 if (omp_get_thread_num() == 0 && rank == 0){
52 printf("Running with %d thread(s)\n",omp_get_num_threads());
53 int bind_policy = omp_get_proc_bind();
54 switch (bind_policy)
55 {
56 case omp_proc_bind_false:
57 printf(" proc_bind is false\n");
58 break;
59 case omp_proc_bind_true:
60 printf(" proc_bind is true\n");
61 break;
62 case omp_proc_bind_master:
63 printf(" proc_bind is master\n");
64 break;
65 case omp_proc_bind_close:
66 printf(" proc_bind is close\n");
67 break;
68 case omp_proc_bind_spread:
69 printf(" proc_bind is spread\n");
70 }
71 printf(" proc_num_places is %d\n",omp_get_num_places());
72 }
73
74 int thread = omp_get_thread_num();
75 cpu_set_t coremask;
76 char clbuf[7 * CPU_SETSIZE], hnbuf[64];
77 memset(clbuf, 0, sizeof(clbuf));
78 memset(hnbuf, 0, sizeof(hnbuf));
79 gethostname(hnbuf, sizeof(hnbuf));
80 sched_getaffinity(0, sizeof(coremask), &coremask);
81 cpuset_to_cstr(&coremask, clbuf);
82 strcpy(clbuf_global[thread],clbuf);
83 socket_global[omp_get_thread_num()] = omp_get_place_num();
84 #pragma omp barrier
85 #pragma omp master
86 for (int i=0; i<omp_get_num_threads(); i++){
87 printf("Hello from rank %02d," ❶
" thread %02d, on %s." ❶
88 " (core affinity = %2s)" ❶
" OpenMP socket is %2d\n", ❶
89 rank, i, hnbuf, ❶
clbuf_global[i], ❶
socket_global[i]); ❶
90 }
91 }
92 }StreamTriad/place_report_mpi_omp.c
41 void place_report_mpi_omp(void)
42 {
43 int rank;
44 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
45
46 int socket_global[144];
47 char clbuf_global[144][7 * CPU_SETSIZE];
48
49 #pragma omp parallel
50 {
51 if (omp_get_thread_num() == 0 && rank == 0){
52 printf("Running with %d thread(s)\n",omp_get_num_threads());
53 int bind_policy = omp_get_proc_bind();
54 switch (bind_policy)
55 {
56 case omp_proc_bind_false:
57 printf(" proc_bind is false\n");
58 break;
59 case omp_proc_bind_true:
60 printf(" proc_bind is true\n");
61 break;
62 case omp_proc_bind_master:
63 printf(" proc_bind is master\n");
64 break;
65 case omp_proc_bind_close:
66 printf(" proc_bind is close\n");
67 break;
68 case omp_proc_bind_spread:
69 printf(" proc_bind is spread\n");
70 }
71 printf(" proc_num_places is %d\n",omp_get_num_places());
72 }
73
74 int thread = omp_get_thread_num();
75 cpu_set_t coremask;
76 char clbuf[7 * CPU_SETSIZE], hnbuf[64];
77 memset(clbuf, 0, sizeof(clbuf));
78 memset(hnbuf, 0, sizeof(hnbuf));
79 gethostname(hnbuf, sizeof(hnbuf));
80 sched_getaffinity(0, sizeof(coremask), &coremask);
81 cpuset_to_cstr(&coremask, clbuf);
82 strcpy(clbuf_global[thread],clbuf);
83 socket_global[omp_get_thread_num()] = omp_get_place_num();
84 #pragma omp barrier
85 #pragma omp master
86 for (int i=0; i<omp_get_num_threads(); i++){
87 printf("Hello from rank %02d," ❶
" thread %02d, on %s." ❶
88 " (core affinity = %2s)" ❶
" OpenMP socket is %2d\n", ❶
89 rank, i, hnbuf, ❶
clbuf_global[i], ❶
socket_global[i]); ❶
90 }
91 }
92 }
❶ Merges OpenMP and the MPI affinity report
我们从编译 Stream Triad 应用程序开始此示例。流三元组代码位于 StreamTriad 目录中的 https://github.com/EssentialsofParallelComputing/Chapter14。使用 编译代码
We start this example by compiling the stream triad application. The stream triad code is at https://github.com/EssentialsofParallelComputing/Chapter14 in the StreamTriad directory. Compile the code with
mkdir build && cd build ./cmake -DCMAKE_VERBOSE=1 .. make
mkdir build && cd build ./cmake -DCMAKE_VERBOSE=1 .. make
我们在 Skylake Gold 处理器上运行此代码,每个处理器有 44 个硬件处理器和 2 个超线程。我们将两个 OpenMP 线程放在超线程上,然后在每个硬件内核上放置一个 MPI 等级。以下命令完成此布局:
We ran this code on our Skylake Gold processor with 44 hardware processors and two hyperthreads each. We placed the two OpenMP threads on the hyperthreads and then an MPI rank on each hardware core. The following commands accomplish this layout:
export OMP_NUM_THREADS=2 mpirun -n 44 --map-by socket ./StreamTriad
export OMP_NUM_THREADS=2 mpirun -n 44 --map-by socket ./StreamTriad
流三元组代码调用了清单 14.2 中的 placement report。图 14.11 显示了输出。
The stream triad code has a call to our placement report from listing 14.2. Figure 14.11 shows the output.
图 14.11 MPI 等级以循环方式分布在套接字上,有两个插槽以容纳两个 OpenMP 线程。放置仅限于 NUMA 域,以使内存靠近线程。这些进程没有与任何特定的虚拟核心紧密绑定,调度程序可以在 NUMA 域内自由移动这些进程。
Figure 14.11 The MPI ranks are placed in a round-robin fashion across the sockets with two slots to accommodate the two OpenMP threads. The placement is restricted to a NUMA domain to keep memory close to the threads. The processes are not bound tightly to any particular virtual core, and the scheduler can move these around freely within the NUMA domain.
如图 14.11 中的输出所示,我们成功地以循环方式将排名分布在 NUMA 域中,并将两个线程保持在一起。这应该会给我们带来来自主内存的良好带宽。关联约束仅足以将进程保持在 NUMA 域中,并允许调度程序根据需要移动进程。调度程序可以将线程 0 放置在 44 个不同的虚拟处理器(包括 0-21 或 44-65)中的任何一个上。编号可能会令人困惑;0 和 44 是同一物理内核上的两个超线程。
As the output in figure 14.11 shows, we succeeded in getting the ranks distributed across the NUMA domains in a round-robin manner, keeping the two threads together. This should give us good bandwidth from main memory. The affinity constraints are only sufficient to keep the processes within the NUMA domain and let the scheduler move the processes around as they wish. The scheduler can place thread 0 on any of 44 different virtual processors, including 0-21 or 44-65. The numbering can be confusing; 0 and 44 are two hyperthreads on the same physical core.
现在让我们尝试获得更多的关联性约束。为此,我们需要使用 -mapby ppr:N:socket:PE=N 的形式。此命令使我们能够以指定的间距分散进程,并指定要在每个套接字上放置多少个 MPI 秩。很难解开选项的复杂性。
Now let’s try to obtain more affinity constraints. For this, we need to use the form -mapby ppr:N:socket:PE=N. This command gives us the ability to spread out the processes with a specified spacing and specify how many MPI ranks to place on each socket. It is hard to unbundle the complexity of the option.
让我们从 ppr:N:socket 部分开始。我们希望每个插槽上都有一半的 MPI 等级。这应该是每个套接字 22 个 MPI 秩或 ppr:22:socket。最后一部分决定了我们在进程放置之间需要多少个处理器。我们希望每个 MPI 等级有两个线程,因此我们希望每个块中有两个虚拟处理器。该规范适用于硬件内核。重要的是要知道每个硬件内核都包含两个虚拟处理器。因此,您只需要一个硬件内核 (PE=1)。然后,我们将线程固定到硬件线程。对于排名 0,我们应该获得具有虚拟处理器 0 和 44 的第一个硬件内核。这为我们提供了以下命令:
Let’s start with the ppr:N:socket part. We want half of our MPI ranks on each socket. This should be 22 MPI ranks per socket or ppr:22:socket. The last part determines how many processors we want between the placement of processes. We want two threads for each MPI rank, so we want two virtual processors in each block. The specification is for hardware cores. It is important to know that each hardware core contains two virtual processors. Therefore, you only need one hardware core (PE=1). We then pin the threads to a hardware thread. For rank 0, we should get the first hardware core with the virtual processors 0 and 44. That gives us the following commands:
export OMP_NUM_THREADS=2 export OMP_PROC_BIND=true mpirun -n 44 --map-by ppr:22:socket:PE=1 ./StreamTriad
export OMP_NUM_THREADS=2 export OMP_PROC_BIND=true mpirun -n 44 --map-by ppr:22:socket:PE=1 ./StreamTriad
呼!这很复杂。我们做对了吗?好吧,让我们检查一下命令的输出,如图 14.12 所示。
Whew! That was complicated. Did we get it right? Well, let’s check the output from the command as shown in figure 14.12.
图 14.12 进程和线程关联性现在被限制在一个逻辑内核上,每个等级的两个 OpenMP 线程位于超线程对上(图中的 0 和 44)。这些排名很紧密,以降低更复杂程序的通信成本。MPI 等级固定到硬件内核,线程关联性与超线程。
Figure 14.12 The process and thread affinity are now constrained to a logical core, and the two OpenMP threads per rank are located on the hyperthread pairs (0 and 44 in the figure). The ranks are packed close in order to reduce communication costs for more complicated programs. The MPI ranks are pinned to hardware cores and the thread affinity is to the hyperthread.
从图 14.12 中的输出中,我们已经将线程锁定在了我们想要的位置。我们还将 MPI 排名固定到硬件核心。你可以通过取消设置 OMP_PROC_BIND 环境变量(unset OMP_PROC_BIND)来验证这一点,输出(图 14.13)确认 rank 绑定到两个逻辑处理器,组成一个硬件内核。
From the output in figure 14.12, we have the threads locked down where we want them. We also have the MPI rank pinned to the hardware core. You can verify this by unsetting the OMP_PROC_BIND environment variable (unset OMP_PROC_BIND) and the output (figure 14.13) confirms that the rank is bound to two logical processors, composing a single hardware core.
图 14.13 没有 OMP_PROC_BIND=true 的输出显示 MPI 秩固定到硬件内核。
Figure 14.13 Output without OMP_PROC_BIND=true shows that the MPI ranks are pinned to hardware cores.
我们已经处理了一个案例,并且能够按照我们想要的方式获得亲和性设置。但是现在您想知道我们是否可以运行两个以上的 OpenMP 线程以及程序的性能如何。让我们看一下一组命令,这些命令测试平均划分为处理器数量的任意数量的线程。下面的清单显示了关键的脚本命令。
We’ve worked through one case and were able to get the affinity settings the way we wanted. But now you want to know if we can run more than two OpenMP threads and how the program performs. Let’s take a look at a set of commands that test any number of threads that divides into the number of processors evenly. The following listing shows the key scripting commands.
列表 14.3 为混合 MPI 和 OpenMP 设置关联性
Listing 14.3 Setting affinity for hybrid MPI and OpenMP
Extracted from StreamTriad/run.sh 1 #!/bin/sh 2 LOGICAL_PES_AVAILABLE=`lscpu |\ ❶ grep '^CPU(s):' |cut -d':' -f 2` ❶ 3 SOCKETS_AVAILABLE=`lscpu |\ ❶ grep '^Socket(s):' |cut -d':' -f 2` ❶ 4 THREADS_PER_CORE=`lscpu |\ ❶ grep '^Thread(s) per core:' |cut -d':' -f 2` ❶ 5 POST_PROCESS="|& grep -e Average -e mpirun |sort -n -k 4" 6 THREAD_LIST_FULL="2 4 11 22 44" 7 THREAD_LIST_SHORT="2 11 22" 8 9 unset OMP_PLACES 10 unset OMP_CPU_BIND 11 unset OMP_NUM_THREADS 12 < ... basic tests not shown ... > 21 22 export OMP_PROC_BIND=true ❷ < ... first loop block not shown ... > 37 for num_threads in ${THREAD_LIST_FULL} 38 do 39 export OMP_NUM_THREADS=${num_threads}} ❷ 40 41 HW_PES_PER_PROCESS=$((${OMP_NUM_THREADS}/ ${THREADS_PER_CORE})) ❸ 42 MPI_RANKS=$((${LOGICAL_PES_AVAILABLE}/ \ ❸ ${OMP_NUM_THREADS})) ❸ 43 PES_PER_SOCKET=$((${MPI_RANKS}/\ ❸ ${SOCKETS_AVAILABLE})) ❸ 44 45 RUN_STRING="mpirun -n ${MPI_RANKS} \ ❹ --map-by ppr:${PES_PER_SOCKET}:socket:PE=${HW_PES_PER_PROCESS} \ ./StreamTriad ${POST_PROCESS}" 46 echo ${RUN_STRING} 47 eval ${RUN_STRING} 48 done < ... additional loop blocks ... >
Extracted from StreamTriad/run.sh 1 #!/bin/sh 2 LOGICAL_PES_AVAILABLE=`lscpu |\ ❶ grep '^CPU(s):' |cut -d':' -f 2` ❶ 3 SOCKETS_AVAILABLE=`lscpu |\ ❶ grep '^Socket(s):' |cut -d':' -f 2` ❶ 4 THREADS_PER_CORE=`lscpu |\ ❶ grep '^Thread(s) per core:' |cut -d':' -f 2` ❶ 5 POST_PROCESS="|& grep -e Average -e mpirun |sort -n -k 4" 6 THREAD_LIST_FULL="2 4 11 22 44" 7 THREAD_LIST_SHORT="2 11 22" 8 9 unset OMP_PLACES 10 unset OMP_CPU_BIND 11 unset OMP_NUM_THREADS 12 < ... basic tests not shown ... > 21 22 export OMP_PROC_BIND=true ❷ < ... first loop block not shown ... > 37 for num_threads in ${THREAD_LIST_FULL} 38 do 39 export OMP_NUM_THREADS=${num_threads}} ❷ 40 41 HW_PES_PER_PROCESS=$((${OMP_NUM_THREADS}/ ${THREADS_PER_CORE})) ❸ 42 MPI_RANKS=$((${LOGICAL_PES_AVAILABLE}/ \ ❸ ${OMP_NUM_THREADS})) ❸ 43 PES_PER_SOCKET=$((${MPI_RANKS}/\ ❸ ${SOCKETS_AVAILABLE})) ❸ 44 45 RUN_STRING="mpirun -n ${MPI_RANKS} \ ❹ --map-by ppr:${PES_PER_SOCKET}:socket:PE=${HW_PES_PER_PROCESS} \ ./StreamTriad ${POST_PROCESS}" 46 echo ${RUN_STRING} 47 eval ${RUN_STRING} 48 done < ... additional loop blocks ... >
❶ Gets hardware characteristics
❷ Sets OMP environment variables
为了使脚本可移植,我们使用 lscpu 命令获取硬件特征。然后,我们设置所需的 OpenMP 环境参数。在这种情况下,我们可以将 OMP_PROC_BIND 设置为 true、close 或 spread,结果相同,其中所有插槽都已填充。然后,我们计算 mpirun 命令所需的变量并启动作业。
To make the script portable, we grab the hardware characteristics using the lscpu command. We then set the desired OpenMP environment parameters. We could set OMP_PROC_BIND to true, close, or spread with the same result for this case, where all the slots are filled. Then we calculate the variables needed for the mpirun command and launch the job.
在清单 14.2 的全流三元组示例中,我们测试了线程大小和 MPI 秩的组合,它们均匀地划分为 88 个进程。我们紧随其后,总共有 44 个进程跳过了超线程,因为我们并没有真正获得更好的性能(第 14.3 节)。性能结果在一组测试中非常稳定。那是因为所测量的只是来自主内存的带宽。几乎没有工作完成,也没有 MPI 通信。在这种情况下,混合 MPI 和 OpenMP 的优势有限。我们期望看到的好处是在更大的模拟中,用 OpenMP 线程代替 MPI 等级
In the full stream triad example in listing 14.2, we tested a combination of thread sizes and MPI ranks that divide evenly into 88 processes. We followed that with 44 total processes where we skip the hyperthreads because we didn’t really get any better performance with them (section 14.3). The performance results are pretty constant over the set of tests. That is because all that is being measured is the bandwidth from main memory. There is little work being done and no MPI communication. The benefits of hybrid MPI and OpenMP are limited in this situation. Where we would expect to see benefits is in much larger simulations where substituting an OpenMP thread for a MPI rank would
Create larger domains that consolidate and reduce ghost cell regions
Reduce contention for processors on a node for a single network interface
Access vector units and other processor components that are not fully utilized
还有一些从命令行控制 affinity 的常规方法。命令行工具可以在 MPI 或特殊并行应用程序没有用于控制相关性的内置选项的情况下提供帮助。这些工具还可以通过将这些应用程序绑定到重要的硬件组件(如图形卡、网络端口和存储设备)来帮助处理通用应用程序。在本节中,我们将介绍两个命令行选项:hwloc 和 likwid 工具套件。这些工具在开发时考虑了高性能计算。
There are also general ways to control affinity from the command line. The command-line tools can help in situations where your MPI or special parallel application doesn’t have built-in options to control affinity. These tools can also help with general-purpose applications by binding these close to important hardware components such as graphics cards, network ports, and storage devices. In this section, we cover two command-line options: the hwloc and likwid suite of tools. These tools are developed with high-performance computing in mind.
hwloc 项目由法国国家计算机科学与自动化研究所 INRIA 开发。作为 OpenMPI 项目的子项目,hwloc 实现了我们在第 14.4 节和第 14.5 节中看到的 OpenMPI 放置和关联功能。hwloc 软件包也是一个带有命令行工具的独立软件包。因为 hwloc 工具很多,所以作为介绍,我们只看其中的几个。我们将使用 hwloc-calc 来获取硬件核心列表,并使用 hwloc-bind 来绑定它们。
The hwloc project was developed by INRIA, the French National Institute for Research in Computer Science and Automation. A subproject of the OpenMPI project, hwloc implements the OpenMPI placement and affinity capabilities that we saw in sections 14.4 and 14.5. The hwloc package is also a standalone package with command-line tools. Because there are many hwloc tools, as an introduction, we’ll just look at a couple of these. We’ll use hwloc-calc to get a list of hardware cores and hwloc-bind to bind these.
使用 hwloc-bind 很简单。只需在应用程序前面加上 hwloc-bind 前缀,然后添加您希望它绑定的硬件位置。对于我们的应用程序,我们将使用 lstopo 命令。lstopo 命令也是 hwloc 工具的一部分。以下是我们在所有硬件内核上启动作业并将进程绑定到内核的单行代码:
Using hwloc-bind is simple. Just prefix the application with hwloc-bind and then add the hardware location where you want it to bind. For our application, we’ll use the lstopo command. The lstopo command is also part of the hwloc tools. Here is our one-liner to launch the job on all the hardware cores and bind the processes to the cores:
for core in `hwloc-calc --intersect core --sep " " all`; do hwloc-bind \
core:${core} lstopo --no-io --pid 0 & donefor core in `hwloc-calc --intersect core --sep " " all`; do hwloc-bind \
core:${core} lstopo --no-io --pid 0 & done
—intersect core 选项仅使用硬件内核。—sep “ ” 表示用空格而不是逗号分隔输出中的数字。在我们通常的 Skylake Gold 处理器上,此命令的结果将启动 44 个 lstopo 图形窗口,每个窗口看起来都与图 14.14 中的类似。每个窗口的绑定位置都以绿色突出显示。
The —intersect core option only uses hardware cores. The —sep " " says to separate the numbers in the output with spaces instead of commas. The result of this command on our usual Skylake Gold processor launches 44 lstopo graphic windows, each looking similar to that in figure 14.14. Each window has the bound locations highlighted in green.
图 14.14 lstopo 图像在左下角以绿色(阴影核心)显示边界位置。这表明进程 22 绑定到第 22 个和第 66 个虚拟内核,它们是单个物理内核的超线程。
Figure 14.14 The lstopo image shows the bound location in green (shaded core) at the lower left. This shows that process 22 is bound to the 22nd and 66th virtual cores, which are hyperthreads for a single physical core.
我们可以使用类似的命令在每个套接字的第一个核心上启动两个进程。例如
We could use a similar command to launch two processes on the first core of each socket. For example
for socket in `hwloc-calc --intersect socket \
--sep " " all`; do hwloc-bind \
socket:${socket}.core:0 lstopo --no-io --pid 0 & donefor socket in `hwloc-calc --intersect socket \
--sep " " all`; do hwloc-bind \
socket:${socket}.core:0 lstopo --no-io --pid 0 & done
下面的清单显示了我们如何构建一个带有 binding 的通用 mpirun 命令。
The following listing shows how we can build a general-purpose mpirun command with binding.
Listing 14.4 使用 hwloc-bind 绑定进程
Listing 14.4 Using hwloc-bind to bind processes
MPI/mpirun_distrib.sh 1 #!/bin/sh 2 PROC_LIST=$1 3 EXEC_NAME=$2 4 OUTPUT="mpirun " ❶ 5 for core in ${PROC_LIST} 6 do 7 OUTPUT="$OUTPUT -np 1"\ ❷ " hwloc-bind core:${core}"\ ❷ " ${EXEC_NAME} :" ❷ 8 done 9 OUTPUT=`echo ${OUTPUT} | sed -e 's/:$/\n/'` ❸ 10 eval ${OUTPUT}
MPI/mpirun_distrib.sh 1 #!/bin/sh 2 PROC_LIST=$1 3 EXEC_NAME=$2 4 OUTPUT="mpirun " ❶ 5 for core in ${PROC_LIST} 6 do 7 OUTPUT="$OUTPUT -np 1"\ ❷ " hwloc-bind core:${core}"\ ❷ " ${EXEC_NAME} :" ❷ 8 done 9 OUTPUT=`echo ${OUTPUT} | sed -e 's/:$/\n/'` ❸ 10 eval ${OUTPUT}
❶ Initializes this string with mpirun
❷ Appends another MPI rank launch with binding
❸ Strips last colon and substitutes a new line
现在,我们可以使用以下命令在每个套接字的第一个内核上启动第 14.4 节中的 MPI 关联应用程序:
Now we can launch our MPI affinity application from section 14.4 on the first core of each socket with this command:
./mpirun_distrib.sh "1 22" ./MPIAffinity
./mpirun_distrib.sh "1 22" ./MPIAffinity
此 mpirun_distrib 脚本构建以下命令并执行它:
This mpirun_distrib script builds the following command and executes it:
mpirun -np 1 hwloc-bind core:1 ./MPIAffinity : -np 1 hwloc-bind core:22 ./MPIAffinity
mpirun -np 1 hwloc-bind core:1 ./MPIAffinity : -np 1 hwloc-bind core:22 ./MPIAffinity
likwid-pin 工具是埃尔朗根大学 likwid(“就像我知道我在做什么”)团队提供的众多优秀工具之一。我们在 3.3.1 节中看到了我们的第一个 likwid 工具 likwid-perfctr。本节中的 likwid 工具是用于设置 affinity 的命令行工具。我们将介绍适用于 OpenMP 线程、MPI 和混合 MPI 以及 OpenMP 应用程序的工具变体。在 likwid 中选择处理器集的基本语法使用以下选项:
The likwid-pin tool is one of the many great tools from the likwid (“Like I Knew What I’m Doing”) team at the University of Erlangen. We saw our first likwid tool, likwid-perfctr in section 3.3.1. The likwid tools in this section are command-line tools to set affinity. We’ll look at variants of the tool for OpenMP threads, MPI, and hybrid MPI plus OpenMP applications. The basic syntax for selecting processor sets in likwid uses these options:
要设置关联性,请使用以下语法:-c <N,S,C,M>:[n1,n2,n3-n4]。要获取编号方案的列表,请使用命令 likwid-pin -p。了解 likwid-pin 的工作原理最好从示例和实验中获得。
To set the affinity, use this syntax: -c <N,S,C,M>:[n1,n2,n3-n4]. To get a list of the numbering schemes, use the command likwid-pin -p. Understanding how likwid-pin works is best gained from examples and experimentation.
Pinning OpenMP threads with likwid-pin
此示例说明如何将 likwid-pin 与 OpenMP 应用程序一起使用:
This example shows how to use likwid-pin with OpenMP applications:
export OMP_NUM_THREADS=44 export OMP_PROC_BIND=spread export OMP_PLACES=threads ./vecadd_opt3
export OMP_NUM_THREADS=44 export OMP_PROC_BIND=spread export OMP_PLACES=threads ./vecadd_opt3
要使用 OpenMP 应用程序的 likwid-pin 获得相同的固定结果,我们使用 socket (S) 选项。在下文中,我们在每个 socket 上分布 22 个线程,其中两个引脚集用 @ 符号分隔和连接:
To get this same pinning result with likwid-pin for OpenMP applications, we use the socket (S) option. In the following, we distribute 22 threads on each socket, where the two pin sets are separated and concatenated with the @ symbol:
likwid-pin -c S0:0-21@S1:0-21 ./vecadd_opt3
likwid-pin -c S0:0-21@S1:0-21 ./vecadd_opt3
使用 likwid-pin 时,OMP 环境变量不是必需的,并且大多数情况下会被忽略。线程数由 pin set 列表确定。对于此命令,它是 44。我们运行了 Section 14.3 中的 vecadd 示例,配置了 -DCMAKE_VERBOSE 选项来获取我们的 placement report,如图 14.15 所示。
The OMP environment variables are not necessary when using likwid-pin and are mostly ignored. The number of threads is determined from the pin set lists. For this command, it is 44. We ran the vecadd example from section 14.3, configured with the -DCMAKE_VERBOSE option to get our placement report as figure 14.15 shows.
图 14.15 likwid-pin 输出位于屏幕顶部,然后是 placement report 输出。输出显示线程被固定到 44 个物理内核。
Figure 14.15 The likwid-pin output is at the top of the screen, followed by our placement report output. The output shows that the threads are pinned to the 44 physical cores.
我们的放置报告显示 OMP 环境变量未设置,并且 OpenMP 尚未在 OpenMP 套接字中放置和固定线程。然而,我们从 likwid-pin 工具中获得了相同的放置和固定,但性能结果相同。我们刚刚确认了 likwid-pin 不需要 OMP 环境变量,正如我们在上一段中声称的那样。需要注意的一点是,如果将 OMP_NUM_THREADS 环境变量设置为引脚集中的线程数以外的值,则 likwid 工具会将 OMP_NUM_THREADS 变量中的线程分布在引脚集中指定的处理器之间。当线程数多于处理器数时,该工具会将线程放置环绕在可用处理器上。
Our placement report shows that the OMP environment variables are not set and that OpenMP has not placed and pinned the threads in the OpenMP sockets. And yet, we get the same placement and pinning from the likwid-pin tool with the same performance results. We have just confirmed that the OMP environment variables are not necessary with likwid-pin as we claimed in the previous paragraph. One thing to note is that if you set the OMP_NUM_THREADS environment variable to something other than the number of threads in the pin sets, the likwid tool distributes the threads from the OMP_NUM_THREADS variable across the processors specified in the pin sets. When there are more threads than processors, the tool wraps the thread placement around on the available processors.
Pinning MPI ranks with likwid-mpirun
用于 MPI 应用程序的 likwid 固定功能包含在 likwid-mpirun 工具中。在大多数 MPI 实现中,您可以使用此工具替代 mpirun。让我们看看 Section 14.4 中的 MPIAffinity 示例。
The likwid pinning functionality for MPI applications is included in the likwid-mpirun tool. You can use this tool as a substitute for mpirun in most MPI implementations. Let’s look at the MPIAffinity example from section 14.4.
图 14.16 显示了此示例的 placement report 的输出。
Figure 14.16 shows the output from our placement report for this example.
图 14.16 likwid-mpirun 的 placement report 显示每个等级都按数字顺序固定到 cores。
Figure 14.16 The placement report for likwid-mpirun shows that each rank is pinned to cores in numeric order.
这很容易!如图 14.16 所示,likwid-mpirun 将 rank 固定到硬件内核。让我们继续看一个示例,我们必须为命令提供一些选项。
That was easy! As figure 14.16 shows, likwid-mpirun pins the ranks to the hardware cores. Let’s move on to an example where we have to provide some options to the command.
如果用户不需要担心关联性,该怎么办?让用户使用复杂的调用来正确放置和固定进程是一项挑战。在许多情况下,将固定逻辑嵌入到可执行文件中可能更有意义。执行此操作的一种方法是查询有关硬件的信息并相应地设置关联。很少有应用程序采用这种方法,但我们希望将来会看到更多应用程序采用这种方法。
What if the user didn’t need to worry about affinity? It is challenging to get users to use the complicated invocations to properly place and pin processes. It might make more sense in many cases to embed the pinning logic into the executable. One way to do this would be to query information about the hardware and set the affinity appropriately. Few applications have yet undertaken this approach, but we expect to see more that do in the future.
某些应用程序不仅在运行时设置其关联性,而且还修改关联性以适应运行时不断变化的特征!这项创新技术是由洛斯阿拉莫斯国家实验室的 Sam Gutiérrez 在他的 QUO 图书馆中开发的。也许您的应用程序使用节点上的所有 MPI 等级,但它调用的库使用 MPI 等级和 OpenMP 线程的组合。QUO 库提供了一个基于 hwloc 构建的简单接口,用于设置适当的亲和性。然后,它可以将设置推送到堆栈上,使处理器停顿,并设置新的绑定策略。在以下部分中,我们将介绍在应用程序中启动进程绑定并在运行时更改进程绑定的示例。
Some applications not only set their affinity at run time but also modify the affinity to adapt to changing characteristics during run time! This innovative technique was developed by Sam Gutiérrez of Los Alamos National Laboratory in his QUO library. Perhaps you have an application that uses all MPI ranks on a node, but it calls a library that uses a combination of MPI ranks and OpenMP threads. The QUO library provides a simple interface built on top of hwloc to set proper affinities. It can then push the settings onto a stack, quiesce the processors, and set a new binding policy. We’ll look at examples of initiating process binding within your application and changing it during run time in the following sections.
在应用程序中设置进程放置和关联性意味着您不再需要处理复杂的 mpirun 命令或 MPI 实现之间的可移植性。在这里,我们使用 QUO 库将这种绑定实现到 Skylake Gold 处理器上的所有内核。开源 QUO 库可在 https://github.com/LANL/libquo.git 获取。首先,我们在 Quo 目录中构建可执行文件,并使用系统上的硬件内核数量运行应用程序:
Setting your process placement and affinities in your application means that you no longer have to deal with complicated mpirun commands or portability between MPI implementations. Here we use the QUO library to implement this binding to all the cores on a Skylake Gold processor. The open source QUO library is available at https://github.com/LANL/libquo.git. First, we build the executable in the Quo directory and run the application with the number of hardware cores on your system:
make autobind mpirun -n 44 ./autobind
make autobind mpirun -n 44 ./autobind
autobind 的源代码如清单 14.5 所示。该程序包含以下步骤。我们的 placement reporting 例程在 Before and 之后被调用,以显示进程绑定。
The source code for autobind is shown in listing 14.5. The program has the following steps. Our placement reporting routine is called before and afterward to show the process bindings.
Listing 14.5 Using QUO to bind processes from your executable
Quo/autobind.c
31 int main(int argc, char **argv)
32 {
33 int ncores, nnoderanks, noderank, rank, nranks;
34 int work_member = 0, max_members_per_res = 2, nres = 0;
35 QUO_context qcontext;
36
37 MPI_Init(&argc, &argv);
38 QUO_create(&qcontext, MPI_COMM_WORLD); ❶
39 MPI_Comm_size(MPI_COMM_WORLD, &nranks);
40 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
41 QUO_id(qcontext, &noderank); ❷
42 QUO_nqids(qcontext, &nnoderanks); ❷
43 QUO_ncores(qcontext, &ncores); ❷
44
45 QUO_obj_type_t tres = QUO_OBJ_NUMANODE; ❷
46 QUO_nnumanodes(qcontext, &nres); ❷
47 if (nres == 0) { ❷
48 QUO_nsockets(qcontext, &nres); ❷
49 tres = QUO_OBJ_SOCKET; ❷
50 } ❷
51
52 if ( check_errors(ncores, nnoderanks, noderank, nranks, nres) )
53 return(-1);
54
55 if (rank == 0)
56 printf("\nDefault binding for MPI processes\n\n");
57 place_report_mpi(); ❸
58
59 SyncIt();
60 QUO_bind_push(qcontext, ❹
QUO_BIND_PUSH_PROVIDED, ❹
61 QUO_OBJ_CORE, noderank); ❹
62 SyncIt();
63
64 QUO_auto_distrib(qcontext, tres, ❺
max_members_per_res, ❺
65 &work_member); ❺
66 if (rank == 0)
67 printf("\nProcesses should be pinned to the hw cores\n\n");
68 place_report_mpi(); ❻
69
70 SyncIt();
71 QUO_bind_pop(qcontext); ❼
72 SyncIt();
73
74 QUO_free(qcontext);
75 MPI_Finalize();
76 return(0);
77 }Quo/autobind.c
31 int main(int argc, char **argv)
32 {
33 int ncores, nnoderanks, noderank, rank, nranks;
34 int work_member = 0, max_members_per_res = 2, nres = 0;
35 QUO_context qcontext;
36
37 MPI_Init(&argc, &argv);
38 QUO_create(&qcontext, MPI_COMM_WORLD); ❶
39 MPI_Comm_size(MPI_COMM_WORLD, &nranks);
40 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
41 QUO_id(qcontext, &noderank); ❷
42 QUO_nqids(qcontext, &nnoderanks); ❷
43 QUO_ncores(qcontext, &ncores); ❷
44
45 QUO_obj_type_t tres = QUO_OBJ_NUMANODE; ❷
46 QUO_nnumanodes(qcontext, &nres); ❷
47 if (nres == 0) { ❷
48 QUO_nsockets(qcontext, &nres); ❷
49 tres = QUO_OBJ_SOCKET; ❷
50 } ❷
51
52 if ( check_errors(ncores, nnoderanks, noderank, nranks, nres) )
53 return(-1);
54
55 if (rank == 0)
56 printf("\nDefault binding for MPI processes\n\n");
57 place_report_mpi(); ❸
58
59 SyncIt();
60 QUO_bind_push(qcontext, ❹
QUO_BIND_PUSH_PROVIDED, ❹
61 QUO_OBJ_CORE, noderank); ❹
62 SyncIt();
63
64 QUO_auto_distrib(qcontext, tres, ❺
max_members_per_res, ❺
65 &work_member); ❺
66 if (rank == 0)
67 printf("\nProcesses should be pinned to the hw cores\n\n");
68 place_report_mpi(); ❻
69
70 SyncIt();
71 QUO_bind_pop(qcontext); ❼
72 SyncIt();
73
74 QUO_free(qcontext);
75 MPI_Finalize();
76 return(0);
77 }
❺ Distributes and binds MPI ranks
❼ Pops off the bindings and returns to initial settings
在更改绑定时,我们需要小心地同步进程。为了确保这一点,在下面的清单中,我们在 SyncIt 例程中使用了 MPI barrier 和微 sleep 调用。
We need to be careful to synchronize processes as we change the bindings. To ensure that, in the following listing, we use an MPI barrier and a micro sleep call in the SyncIt routine.
Listing 14.6 SyncIt subroutine
Quo/autobind.c
23 void SyncIt(void)
24 {
25 int rank;
26 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
27 MPI_Barrier(MPI_COMM_WORLD); ❶
28 usleep(rank * 1000); ❷
29 }Quo/autobind.c
23 void SyncIt(void)
24 {
25 int rank;
26 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
27 MPI_Barrier(MPI_COMM_WORLD); ❶
28 usleep(rank * 1000); ❷
29 }
autobind 应用程序的输出(图 14.17)清楚地显示了从 sockets 到 hardware cores 的绑定发生了变化。
The output from the autobind application (figure 14.17) clearly shows the bindings changed from sockets to the hardware cores.
图 14.17 autobind 演示的输出显示了最初绑定到套接字的内核,但之后,这些内核被绑定到硬件内核。
Figure 14.17 The output from the autobind demo shows cores initially bound to sockets, but afterwards, these are bound to hardware cores.
假设我们有一个应用程序,其中一部分想要使用所有 MPI 等级,而另一部分最适合 OpenMP 线程。为了处理这个问题,我们需要在运行时切换 Affinityities。这就是 QUO 的设计场景!此操作的步骤包括
Suppose we have an application with one part that wants to use all MPI ranks and another part that works best with OpenMP threads. To handle this, we need to switch the affinities during run time. This is the scenario that QUO is designed for! The steps for this include
Let’s see how this is done with Quo in the following listing.
列表 14.7 从 MPI 切换到 OpenMP 的动态关联演示
Listing 14.7 Dynamic affinity demo switching from MPI to OpenMP
Quo/dynaffinity.c
45 int main(int argc, char **argv)
46 {
47 int rank, noderank, nnoderanks;
48 int work_member = 0, max_members_per_res = 44;
49 QUO_context qcontext;
50
51 MPI_Init(&argc, &argv);
52 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
53 QUO_create(&qcontext, MPI_COMM_WORLD); ❶
54
55 node_info_report(qcontext, &noderank, &nnoderanks);
56
57 SyncIt();
58 QUO_bind_push(qcontext, ❷
QUO_BIND_PUSH_PROVIDED, ❷
59 QUO_OBJ_CORE, noderank); ❷
60 SyncIt();
61
62 QUO_auto_distrib(qcontext, QUO_OBJ_SOCKET, ❸
max_members_per_res, ❸
63 &work_member); ❸
64
65 place_report_mpi_quo(qcontext); ❹
66
67 /* change binding policies to accommodate OMP threads on node 0 */
68 bool on_rank_0s_node = rank < nnoderanks;
69 if (on_rank_0s_node) {
70 if (rank == 0) {
71 printf("\nEntering OMP region...\n\n");
72 // expands the caller's cpuset
// to all available resources on the node.
73 QUO_bind_push(qcontext, ❺
QUO_BIND_PUSH_OBJ, ❺
QUO_OBJ_SOCKET, -1); ❺
74 report_bindings(qcontext, rank); ❻
75 /* do the OpenMP calculation */
76 place_report_mpi_omp(); ❼
77 /* revert to old binding policy */
78 QUO_bind_pop(qcontext); ❽
79 }
80 /* QUO_barrier because it's cheaper than
MPI_Barrier on a node. */
81 QUO_barrier(qcontext);
82 }
83 SyncIt();
84
85 // Wrap-up
86 QUO_free(qcontext);
87 MPI_Finalize();
88 return(0);
89 }Quo/dynaffinity.c
45 int main(int argc, char **argv)
46 {
47 int rank, noderank, nnoderanks;
48 int work_member = 0, max_members_per_res = 44;
49 QUO_context qcontext;
50
51 MPI_Init(&argc, &argv);
52 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
53 QUO_create(&qcontext, MPI_COMM_WORLD); ❶
54
55 node_info_report(qcontext, &noderank, &nnoderanks);
56
57 SyncIt();
58 QUO_bind_push(qcontext, ❷
QUO_BIND_PUSH_PROVIDED, ❷
59 QUO_OBJ_CORE, noderank); ❷
60 SyncIt();
61
62 QUO_auto_distrib(qcontext, QUO_OBJ_SOCKET, ❸
max_members_per_res, ❸
63 &work_member); ❸
64
65 place_report_mpi_quo(qcontext); ❹
66
67 /* change binding policies to accommodate OMP threads on node 0 */
68 bool on_rank_0s_node = rank < nnoderanks;
69 if (on_rank_0s_node) {
70 if (rank == 0) {
71 printf("\nEntering OMP region...\n\n");
72 // expands the caller's cpuset
// to all available resources on the node.
73 QUO_bind_push(qcontext, ❺
QUO_BIND_PUSH_OBJ, ❺
QUO_OBJ_SOCKET, -1); ❺
74 report_bindings(qcontext, rank); ❻
75 /* do the OpenMP calculation */
76 place_report_mpi_omp(); ❼
77 /* revert to old binding policy */
78 QUO_bind_pop(qcontext); ❽
79 }
80 /* QUO_barrier because it's cheaper than
MPI_Barrier on a node. */
81 QUO_barrier(qcontext);
82 }
83 SyncIt();
84
85 // Wrap-up
86 QUO_free(qcontext);
87 MPI_Finalize();
88 return(0);
89 }
❷ Sets affinities to hardware cores
❸ Distributes and binds MPI ranks
❹ Reports process affinities for all MPI regions
❺ Sets affinity to whole system
❻ Reports CPU masks for OpenMP region
❼ Reports process affinities for OpenMP region
❽ Pops off bindings and returns to MPI bindings
我们可以用系统上的硬件内核数量运行 dynaffinity 应用程序
We can run the dynaffinity application with the number of hardware cores on our system with
make dynaffinity mpirun -n 44 ./dynaffinity
make dynaffinity mpirun -n 44 ./dynaffinity
我们再次使用报告例程来检查 MPI 区域和 OpenMP 的进程绑定。图 14.18 显示了输出。
We again use our reporting routines to check the process bindings for the MPI region and for OpenMP. Figure 14.18 displays the output.
图 14.18 对于 MPI 区域,进程绑定到硬件核心。当我们进入 OpenMP 区域时,亲和力将扩展到整个节点。
Figure 14.18 For the MPI region, the processes are bound to the hardware cores. When we enter the OpenMP region, the affinities are expanded to the whole node.
图 14.18 中的输出显示 MPI 和 OpenMP 区域之间的进程绑定发生了变化,从而在运行时完成了对亲和性的动态修改。
The output in figure 14.18 shows that the process bindings changed between the MPI and the OpenMP regions, accomplishing a dynamic modification of the affinities during run time.
进程放置和绑定的处理相对较新。请关注 MPI 和 OpenMP 社区中的演示文稿,了解该领域的其他发展。在下一节中,我们列出了一些最新的亲和力材料,我们建议进一步阅读。我们将在额外阅读材料之后进行一些练习,以进一步探索该主题。
The handling of process placement and bindings is relatively new. Watch for presentations in the MPI and OpenMP communities for additional developments in this area. In the next section, we list some of the most current materials on affinity that we recommend for additional reading. We’ll follow the additional reading with some exercises to explore the topic further.
本章中用于 OpenMP、MPI 和 MPI 加 OpenMP 的进程放置报告程序是从多个 HPC 站点的训练中使用的 xthi.c 程序修改而来的。以下是使用它来探索亲和力的论文和演示文稿的参考文献:
The process placement reporting programs used in this chapter for OpenMP, MPI, and MPI plus OpenMP are modified from the xthi.c program used in training for several HPC sites. Here are references to papers and presentations that use it to explore affinities:
Y. He、B. Cook 等人,“为 Cori 准备 NERSC 用户,Cori 是具有 Intel 许多集成内核的 Cray XC40 系统”,并发计算:Pract Exper.,2018 年;30:e4291 (https://doi.org/10.1002/cpe.4291).
Y. He, B. Cook, et al., “Preparing NERSC users for Cori, a Cray XC40 system with Intel many integrated cores” In Concurrency Computat: Pract Exper., 2018; 30:e4291 (https://doi.org/10.1002/cpe.4291).
阿贡国家实验室,“Affinity on Theta”,https://www.alcf.anl.gov/ support-center/theta/affinity-theta。
Argonne National Laboratory, “Affinity on Theta,” at https://www.alcf.anl.gov/ support-center/theta/affinity-theta.
美国国家能源研究科学计算中心 (NERSC),“进程和线程关联”,https://docs.nersc.gov/jobs/affinity/。
National Energy Research Scientific Computing Center (NERSC), “Process and Thread Affinity,” at https://docs.nersc.gov/jobs/affinity/.
以下是有关 OpenMP 的精彩演示,其中包括对关联及其处理方式的讨论:
Here’s a good presentation on OpenMP that includes a discussion on affinity and how to handle it:
T. Mattson 和 H.他,“OpenMP:超越共同核心”,http://mng.bz/ aK47。
T. Mattson and H. He, “OpenMP: Beyond the common core,” at http://mng.bz/ aK47.
我们只介绍了 OpenMPI 中 mpirun 命令的部分选项。有关探索更多功能的信息,请参阅 OpenMPI 的手册页:
We only covered part of the options for the mpirun command in OpenMPI. For exploring more capabilities, see the man page for OpenMPI:
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php。
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php.
Portable Hardware Locality (hwloc) 是 Open MPI Project 的一个子项目。它是一个独立的软件包,与 OpenMPI 或 MPICH 配合使用效果一样好,并已成为大多数 MPI 实现和许多其他并行编程软件应用程序的通用硬件接口。有关详细信息,请参阅以下参考资料:
Portable Hardware Locality (hwloc) is a subproject of The Open MPI Project. It is a standalone package that works equally well with either OpenMPI or MPICH and has become the universal hardware interface for most MPI implementations and many other parallel programming software applications. For further information, see the following references:
https://www.open-mpi.org/projects/hwloc/ hwloc 项目主页,您还可以在其中找到一些重要的演示文稿。
The hwloc project main page https://www.open-mpi.org/projects/hwloc/, where you’ll also find some key presentations.
B. Goglin,“了解和管理硬件与硬件局部性的硬件相关性 (hwlooc)”,高性能和嵌入式体系结构和编译 (HiPEAC,2013),http://mng.bz/gxYV。
B. Goglin, “Understanding and managing hardware affinities with Hardware Locality (hwlooc),” High Performance and Embedded Architecture and Compilation (HiPEAC, 2013), http://mng.bz/gxYV.
“Like I Know What I'm Doing” (likwid) 工具套件因其简单性、可用性和良好的文档而广受好评。以下是进一步研究这些工具的良好起点:
The “Like I Knew What I’m Doing” (likwid) suite of tools is well regarded for its simplicity, usability, and good documentation. Here is a good starting point to investigate these tools further:
埃尔朗根-纽伦堡大学的性能监控和基准测试套件 https://github.com/RRZE-HPC/likwid/wiki.
University of Erlangen-Nuremberg’s performance monitoring and benchmarking suite, https://github.com/RRZE-HPC/likwid/wiki.
这个关于 QUO 库的会议演示提供了更完整的概述及其背后的理念:
This conference presentation about the QUO library gives a more complete overview and the philosophy behind it:
S. Gutiérrez 等人,“在耦合并行应用程序中适应线程级异构性”,https://github.com/lanl/libquo/blob/master/docs/slides/gutier rez-ipdps17.pdf,2017 年国际并行和分布式处理研讨会 (IPDPS17)。
S. Gutiérrez et al., “Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications,” https://github.com/lanl/libquo/blob/master/docs/slides/gutier rez-ipdps17.pdf, 2017 International Parallel and Distributed Processing Symposium (IPDPS17).
Generate a visual image of a couple of different hardware architectures. Discover the hardware characteristics for these devices.
For your hardware, run the test suite using the script in listing 14.1. What did you discover about how to best use your system?
更改第 14.3 节中的向量加法 (vecadd_opt3.c) 示例中使用的程序,以包含更多浮点运算。获取内核并将循环中的运算更改为毕达哥拉斯公式:
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
Change the program used in the vector addition (vecadd_opt3.c) example in section 14.3 to include more floating-point operations. Take the kernel and change the operations in the loop to the Pythagorean formula:
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
How do your results and conclusions about the best placement and bindings change? Do you see any benefit from hyperthreads now (if you have those)?
对于第 14.4 节中的 MPI 示例,请包括向量 add kernel 并为内核生成缩放图。然后将内核替换为练习 3 中使用的毕达哥拉斯公式。
For the MPI example in section 14.4, include the vector add kernel and generate a scaling graph for the kernel. Then replace the kernel with the Pythagorean formula used in exercise 3.
将 vector add 和 Pythagorean 公式组合到以下例程中(在单个循环或两个单独的循环中)以获得更多的数据重用:
c[i] = a[i] + b[i]; d[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
Combine the vector add and Pythagorean formula in the following routine (either in a single loop or two separate loops) to get more data reuse:
c[i] = a[i] + b[i]; d[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
How does this change the results of the placement and binding study?
Add code to set the placement and affinity within an application from one of the previous exercises.
There are tools that show your process placement. These tools can also show you the affinity for your processes.
Use process placement for your parallel applications. This gives you full main memory bandwidth for your application.
Select a good process ordering for your OpenMP threads or MPI ranks. A good ordering reduces communication costs between processes.
Use a binding policy for your parallel processes. Binding each process keeps the kernel from moving your process and losing the data it has loaded into cache.
It is possible to change affinity within your application. This can accommodate code sections that would do better with different process affinities.
大多数高性能计算系统使用批处理计划程序来计划应用程序的运行。我们将在本章的第一部分简要说明原因。由于计划程序在高端系统上无处不在,因此您至少应该对它们有基本的了解,以便能够在高性能计算中心甚至更小的集群上运行作业。我们将介绍批处理调度程序的用途和用法。我们不会讨论如何设置和管理它们(那是另一回事)。设置和管理是系统管理员的话题,我们只是低级的系统用户。
Most high performance computing systems use batch schedulers to schedule the running of applications. We’ll give you a brief idea why in the first section of this chapter. Because schedulers are ubiquitous on high-end systems, you should have at least a basic understanding of them to be able to run jobs at high-performance computing centers and even smaller clusters. We’ll cover the purpose and usage of the batch schedulers. We won’t go into how to set up and manage them (that’s a whole other beast). Set up and management is a topic for system administrators and we are just lowly system users.
如果您无权访问具有批处理调度程序的系统,该怎么办?我们不建议仅仅为了尝试这些示例而安装批处理计划程序。反之,要数算你的祝福,并把本章里的信息放在手边,以备不时之需。如果您对计算资源的需求增长,并且您开始使用更大的多用户集群,则可以返回本章。
What if you don’t have access to a system with a batch scheduler? We don’t recommend installing a batch scheduler just to try out these examples. Rather, count your blessings and keep the information in this chapter handy for when the need arises. If your demand for computational resources grows and you begin using a larger multi-user cluster, you can come back to this chapter.
有许多不同的批处理计划程序,每个安装都有自己独特的自定义设置。我们将讨论两个免费提供的批处理调度程序:可移植批处理系统 (PBS) 和用于资源管理 (Slurm) 的简单 Linux 实用程序。每种版本都有变体,包括商业支持的版本。
There are many different batch schedulers, and each installation has its own unique customizations. We’ll discuss two batch schedulers that are freely available: the Portable Batch System (PBS) and the Simple Linux Utility for Resource Management (Slurm). There are variants of each of these, including commercially supported versions.
PBS 计划程序于 1991 年起源于 NASA,并于 1998 年以 OpenPBS 的名义作为开源发布。随后,商业版本(Altair 的 PBS Professional 和 Adaptive Computing Enterprises 的 PBS/TORQUE)被分叉为单独的版本。免费提供的版本仍然可用,并且在较小的集群上普遍使用。较大的高性能计算站点往往具有类似的版本,但具有支持合同。
The PBS scheduler originated at NASA in 1991 and was released as open source under the name OpenPBS in 1998. Subsequently, commercial versions, PBS Professional by Altair and PBS/TORQUE by Adaptive Computing Enterprises, were forked off as separate versions. Freely available versions are still available and in common use on smaller clusters. Larger high performance computing sites tend to have similar versions but with a support contract.
Slurm 调度程序于 2002 年起源于劳伦斯利弗莫尔国家实验室,作为 Linux 集群的简单资源管理器。它后来被分拆成各种衍生版本,例如 SchedMD 版本。
The Slurm scheduler originated at Lawrence Livermore National Laboratory in 2002 as a simple resource manager for Linux clusters. It later was spun off into various derivative versions such as the SchedMD version.
计划程序还可以使用插件或加载项进行自定义,这些插件或加载项提供附加功能、对特殊工作负载的支持和改进的计划算法。您还会发现许多严格的商业批处理计划程序,但它们的功能与此处介绍的功能类似。每个调度程序实施的基本概念都非常相似,并且通常,许多细节因站点而异。批处理脚本的可移植性仍然是一个挑战,需要对每个系统进行一些自定义。
Schedulers can also be customized with plugins or add-ins that provide additional functionality, support for special workloads, and improved scheduling algorithms. You’ll also find a number of strictly commercial batch schedulers, but their functionality is similar to those presented here. The basic concepts of each scheduler implementation are much the same, and often, many details vary from site to site. Portability of batch scripts can still be a bit of a challenge and require some customization for each system.
您刚刚为组启动了最新的集群,并且软件正在运行。很快,您将有十几位同事登录并启动作业。Ka-Boom — 计算节点上有多个并行作业相互冲突,减慢了这些作业的速度,有时还会导致某些作业崩溃。空气中弥漫着明显的紧张气氛,脾气暴躁。
You just got your latest cluster up for your group and the software is running. Soon, you’ll have a dozen of your colleagues logging in and launching jobs. Ka-Boom—you have multiple parallel jobs on compute nodes colliding with each other, slowing these down, and sometimes, causing some jobs to crash. Palpable tension is in the air and tempers are short.
随着高性能计算系统的规模和用户数量的增加,有必要向系统添加一些管理,以使混乱变得井然有序,并从硬件中获得最大性能。安装批处理调度器可以节省这一天(图 15.1)。用户可以运行用户作业,并且将硬件作为资源专门使用成为现实。然而,使用批处理系统并不是万能的。虽然这种类型的软件为集群或高性能计算系统的用户提供了很多帮助,但批处理调度程序需要大量的系统管理时间,并建立不同的队列和策略。通过良好的策略,您可以获得私有分配的计算节点,供您在固定时间段内独占使用。
As high performance computing systems grow in size and number of users, it becomes necessary to add some management to the system to bring order to chaos and get the most performance from the hardware. Installation of a batch scheduler can save the day (figure 15.1). User jobs can be run, and the exclusive use of the hardware as a resource becomes a reality. However, the use of a batch system is not a panacea. While this type of software offers much to the users of the cluster or high performance computing system, batch schedulers require significant system administration time and the establishment of different queues and policies. With good policies, you can obtain privately allocated compute nodes for your exclusive use for a fixed block of time.
图 15.1 批处理系统类似于计算机集群的超市结账排队系统。这些有助于更好地利用资源并提高您的工作效率。
Figure 15.1 Batch systems are like the supermarket checkout queueing system for a computer cluster. These help to make better use of the resources and bring more efficiency to your jobs.
系统管理软件提供的顺序对于在并行应用程序上实现性能至关重要。Beowulf 集群中批处理调度器的历史工作(在 Section 15.6.1 中提到)很好地说明了调度器的重要性。在 1990 年代后期,贝奥武夫集群成为一项广泛的运动,旨在从商用计算机构建计算集群。Beowulf 社区很快意识到,仅仅拥有一系列计算硬件是不够的;有必要进行一些软件控制和管理,使其成为一种生产性资源。
The order provided by the system management software is absolutely essential for achieving performance on your parallel applications. The historical work on batch schedulers in Beowulf clusters (mentioned in section 15.6.1) gives a good perspective on the importance of schedulers. In the late 1990s, Beowulf clusters emerged as a widespread movement to build computing clusters out of commodity computers. The Beowulf community soon realized that it was not enough to have a collection of computing hardware; it was necessary to have some software control and management to make it a productive resource.
繁忙的集群有很多用户和大量工作。通常实施批处理系统来管理工作负载并充分利用系统。这些集群的环境与独立的单用户工作站不同。在这些繁忙的集群上工作时,必须知道如何在考虑其他用户的同时有效地使用系统。我们将为您提供一些明示和未明示的社交规则,以免成为繁忙集群中的贱民。但首先,让我们考虑一下这些典型系统是如何设置的。
Busy clusters have lots of users and lots of work. A batch system is often implemented to manage the workload and get the most out of the system. These clusters are a different environment than a standalone, single-user workstation. When working on these busy clusters, it is essential to know how to effectively use the system while being considerate of other users. We’ll give you some of the stated and unstated social rules so as to not become a pariah on the busy cluster. But first, let’s consider how these typical systems are set up.
大多数集群都留出一些节点作为前端。这些前端节点也称为登录节点,因为这是您登录系统时将位于的位置。然后,系统的其余部分将设置为后端节点,这些节点由批处理系统控制和分配。这些后端节点被组织到一个或多个队列中。每个队列都有一组策略,用于任务的大小(例如处理器或内存的数量)以及这些任务可以运行多长时间。
Most clusters have some nodes set aside to be front ends. These front-end nodes are also called login nodes because that is where you will be when you log in to the system. The rest of the system is then set up as back-end nodes that are controlled and allocated by the batch system. These back-end nodes are organized into one or more queues. Each queue has a set of policies for things like the size of jobs (such as the number of processors or memory) and how long these jobs can run.
使用 top 命令检查前端的负载,然后移动到负载较轻的前端节点。通常有多个前端节点,其数字为 fe01、fe02 和 fe03。
Check the load on your front end with the top command and move to a lightly loaded front-end node. There is usually more than one front-end node with numbers such as fe01, fe02, and fe03.
注意前端的繁重文件传输作业。某些站点为这些类型的作业提供了特殊的队列。如果您获得的节点文件使用率很高,您可能会发现您的编译或其他作业可能比平时花费更长的时间,即使负载似乎并不高。
Watch for heavy file-transfer jobs on the front end. Some sites have special queues for these types of jobs. If you get a node that has heavy file usage, you may find that your compiles or other jobs might take much longer than usual even if the load does not appear to be high.
Some sites want you to compile on the back end and others on the front end. Check the policies on your cluster.
Don’t tie up nodes with batch interactive sessions and then go off to attend a meeting for several hours.
Rather than get a second batch interactive session, export an X terminal or shell from your first session.
For light work, look for queues for shared usage that allow over-subscription.
Many sites have special queues for debugging. Use these when you need to debug, but don’t abuse these debug queues.
Big parallel jobs should be run on the back-end nodes through the batch system queues.
Keep the number of jobs in the queue small: don’t monopolize the queues.
Try to run your big jobs during non-work hours so other users can get interactive nodes for their work.
Store large files in the appropriate place. Most sites have large parallel file systems, scratch, project, or work directories for output from calculations.
Know the purging policies for file storage. Large sites will purge files in some of the scratch directories on a periodic basis.
Clean up your files regularly and keep file systems below 90% full. File system performance drops off as file systems become full.
注意不要害怕向造成问题的用户发送私人消息,但要有礼貌。他们可能没有意识到他们的工作正在使许多工作流程陷入停滞。
Note Don’t be afraid to send private messages to users who are causing problems, but be courteous. They may not realize that their work is bringing many workflows to a standstill.
Further cluster wisdom includes the following:
Heavy usage of the front-end nodes can cause instabilities and crashes. These instabilities affect the whole system as jobs can no longer be scheduled for the back-end nodes.
通常,项目会获得用于使用 “fair-share” 计划算法确定作业优先级的资源分配。在这些情况下,您可能需要提交项目所需资源的申请。
Often projects get resource allocations that are used for prioritizing jobs using a “fair-share” scheduling algorithm. In these cases, you may need to submit an application for the resources that you need for your project.
每个站点都可以设置实施规则的策略,但这些策略无法涵盖所有情况。您应该遵循规则的精神以及实际实施。换句话说,不要玩弄系统。它不是一个无生命的物体,而是您的其他用户。他们也在努力完成工作。
Each site can set policies that implement rules, but these cannot cover every situation. You should follow the spirit of the rules as well as the actual implementation. In other words, don’t game the system. It is not an inanimate object but rather your fellow users. They are also trying to get work done.
Rather than gaming the system, you should optimize your code and your file storage. The savings will allow you to get more work done and will let others get their work done on the cluster as well.
在一次只能运行少数作业的情况下,将数百个作业提交到一个队列中是不体贴的。我们通常一次最多提交 10 个左右的工作,然后在每个工作完成后提交额外的工作。有很多方法可以通过 shell 脚本甚至批处理依赖技术(在本章后面讨论)来实现这一点。
Submitting several hundred jobs into a queue when only a few can run at a time is inconsiderate. We generally submit a maximum of ten or so at a time and then submit additional jobs when each of those completes. There are many ways of doing this through shell scripts or even the batch dependency techniques (discussed later in this chapter).
对于需要运行时间远长于允许的最大批处理时间的作业,您应该实施检查点(第 15.4 节)。Checkpointing 捕获批量终止信号或使用挂钟计时来最有效地利用整个批量时间。然后,后续作业从上一个作业停止的位置开始。
For jobs that require run times much longer than the maximum batch time allowed, you should implement checkpointing (section 15.4). Checkpointing catches batch termination signals or uses wall clock timings to get the most effective use of the whole batch time. A subsequent job then starts where the last one stopped.
在本节中,我们将介绍提交您的第一个批处理脚本的过程。批处理系统需要不同的思维方式。您不能随时启动工作,而是必须考虑组织您的工作。规划可以更好地利用资源,甚至在提交作业之前。您如何使用这些批处理系统?如图 15.2 所示,有两种基本的系统模式。
In this section, we’ll go through the process of submitting your first batch script. Batch systems require a different way of thinking. Instead of just launching a job whenever you want, you have to think about organizing your work. Planning results in better use of resources even before your jobs get submitted. How do you use these batch systems? As figure 15.2 shows, there are two basic system modes.
在一种模式下使用的大多数命令也可以在另一种模式下使用。让我们通过几个示例来了解这些模式是如何运作的。在第一组示例中,我们将使用 Slurm 批处理调度程序。我们将从一个交互式示例开始,然后将该示例修改为批处理文件形式。
Most of the commands used in one mode can also be used in the other. Let’s work through a couple of examples to see how these modes function. We’ll work with the Slurm batch scheduler in this first set of examples. We’ll start with an interactive example and modify the example into a batch file form.
交互式命令行模式通常用于程序开发、测试或短期作业。对于提交较长的生产作业,更常见的是使用批处理文件来提交批处理作业。批处理文件允许用户在夜间或无人值守的情况下运行应用程序。甚至可以编写批处理脚本,以便在发生灾难性系统事件时自动重启作业。我们将显示从命令行选项到批处理脚本的语法转换。但首先我们需要了解批处理脚本的基本结构。
The interactive command-line mode is generally used for program development, testing, or short jobs. For submitting longer production jobs, it is more common to use a batch file to submit a batch job. The batch file allows the user to run applications overnight or unattended. Batch scripts can even be written to automatically restart jobs if there is some catastrophic system event. We’ll show the translation in syntax from the command-line option to a batch script. But first we need to go over the basic structure of a batch script.
图 15.2 批处理系统通常以交互模式或批处理使用模型使用。
Figure 15.2 Batch systems are typically used in either an interactive mode or a batch usage model.
我们在表 15.1 中展示了 Slurm 的一些更常见的选项。
We show some of the more common options for Slurm in table 15.1.
Table 15.1 Slurm command options
|
[--output|-o]=文件名 [--output|-o]=filename |
将标准输出写入指定的文件名 Writes standard output to the specified filename |
|
让我们继续将所有这些放在我们的第一个完整 Slurm 批处理脚本中,如下面的清单所示。此示例包含在 https://github.com/EssentialsofParallelComputing/Chapter15 的书籍的关联代码中。与往常一样,我们鼓励您遵循本章的示例。
Let’s go ahead and put this all together into our first full Slurm batch script as the following listing shows. This example is included with the associated code for the book at https://github.com/EssentialsofParallelComputing/Chapter15. As always, we encourage you to follow along with the examples for this chapter.
Listing 15.1 Slurm batch script for a parallel job
1 #!/bin/sh 2 #SBATCH -N 1 ❶ 3 #SBATCH -n 4 ❷ 5 #SBATCH -t 01:00:00 ❸ 6 7 # Do not place bash commands before the last SBATCH directive 8 # Behavior can be unreliable 9 10 mpirun -n 4 ./testapp &> run.out
1 #!/bin/sh 2 #SBATCH -N 1 ❶ 3 #SBATCH -n 4 ❷ 5 #SBATCH -t 01:00:00 ❸ 6 7 # Do not place bash commands before the last SBATCH directive 8 # Behavior can be unreliable 9 10 mpirun -n 4 ./testapp &> run.out
第 2 行的 -N 也可以用 —nodes 指定。-N 在其他批处理计划程序和 MPI 实现中具有不同的含义,从而导致不正确的值和错误。您应该注意所使用的批处理系统和 MPI 集的语法不一致。然后,我们使用 sbatch < first_ slurm_batch_job 提交此作业。我们将在交互式作业中获得与批处理作业等效的函数,其中
The -N on line 2 can alternatively be specified with —nodes. The -N has a different meaning in other batch schedulers and MPI implementations, leading to incorrect values and errors. You should be on the lookout for inconsistencies in syntax for the set of batch systems and MPIs that you use. We then submit this job with sbatch < first_ slurm_batch_job. We’ll get the equivalent of the batch job in an interactive job with
frontend> salloc -N 1 -n 4 -t 01:00:00 computenode22> mpirun -n 4 ./testapp computenode22> exit
frontend> salloc -N 1 -n 4 -t 01:00:00 computenode22> mpirun -n 4 ./testapp computenode22> exit
Note The options are the same in both the batch file and on the command line.
我们需要特别提到独家和超额订阅选项。使用批处理系统的主要原因之一是获得资源的独占使用权,以实现更高效的应用程序性能。几乎每个主要计算中心都将默认行为设置为独占使用资源。但是,该配置可能会为特定用例设置一个要共享的分区。您可以将这些命令选项 exclusive 和 oversubscribe 用于 sbatch 和 srun 命令,以请求与系统配置不同的行为。但是,您无法覆盖分区的共享配置设置。
We need to make a special mention of the exclusive and oversubscribe options. One of the major reasons for using a batch system is to get exclusive use of the resource for more efficient application performance. Nearly every major computing center sets the default behavior to exclusive use of the resource. But the configuration may set one partition to be shared for particular use cases. You can use these command options, exclusive and oversubscribe, for the sbatch and srun commands to request a different behavior than the system configuration. However, you cannot override the shared configuration setting for a partition.
大多数大型计算系统由许多具有相同特征的节点组成。但是,具有各种节点类型的系统越来越常见。Slurm 提供了可以请求具有特殊特征的节点的命令。例如,您可以使用 —mem=<#> 获取具有请求大小(以 MB 为单位)的大型内存节点。还有许多其他特殊请求可以通过批处理系统提出。PBS 批处理计划程序的批处理脚本类似,但语法不同。表 15.2 显示了一些最常见的 PBS 选项。
Most large computing systems are composed of many nodes with identical characteristics. It is, however, increasingly common to have systems with a variety of node types. Slurm provides commands that can request nodes with special characteristics. For example, you can use —mem=<#> to get large memory nodes with the requested size in MB. There are many other special requests that can be made through the batch system. A batch script for the PBS batch scheduler is similar, but with a different syntax. Some of the most common PBS options are shown in table 15.2.
Table 15.2 PBS command options
-l 选项是一个 catch-all,用于各种选项。让我们把与清单 15.1 中相同的任务的等效 PBS 批处理脚本放在一起。以下列表显示了 PBS 脚本。
The -l option is a catch-all that is used for a variety of options. Let’s put together the equivalent PBS batch script for the same job as in listing 15.1. The following listing shows the PBS script.
Listing 15.2 PBS batch script for a parallel job
1 #!/bin/sh 2 #PBS -l nodes=1 ❶ 3 #PBS -l procs=4 ❶ 5 #PBS -l walltime=01:00:00 ❶ 6 7 # Do not place bash commands before the last PBS directive 8 # Behavior can be unreliable 9 10 mpirun -n 4 ./testapp &> run.out
1 #!/bin/sh 2 #PBS -l nodes=1 ❶ 3 #PBS -l procs=4 ❶ 5 #PBS -l walltime=01:00:00 ❶ 6 7 # Do not place bash commands before the last PBS directive 8 # Behavior can be unreliable 9 10 mpirun -n 4 ./testapp &> run.out
对于 PBS,我们使用 qsub < first_pbs_batch_job 提交作业。要在 PBS 中获得交互式分配,我们使用 qsub 的 -I 选项:
For PBS, we submit the job with qsub < first_pbs_batch_job. To get an interactive allocation in PBS, we use the -I option to qsub:
frontend> qsub -I -l nodes=1,procs=4,walltime=01:00:00 computenode22> mpirun -n 4 ./testapp &> run.out computenode22> exit
frontend> qsub -I -l nodes=1,procs=4,walltime=01:00:00 computenode22> mpirun -n 4 ./testapp &> run.out computenode22> exit
您可能需要为这些示例指定队列或其他特定于站点的信息。许多站点对于长、短、大型和其他特殊情况都有不同的队列。有关这些重要详细信息,请参阅本地站点文档。
You may need to specify a queue or other site-specific information for these examples. Many sites have different queues for long, short, large, and other specialized situations. Consult the local site documentation for these important details.
在前面的讨论中,我们已经看到了几个 batch scheduler 命令。要有效地使用该系统,您将需要更多命令,如下所示。这些批处理计划程序命令检查作业的状态,获取有关系统资源的信息,并取消作业。接下来,我们总结了 Slurm 和 PBS 计划程序的最常见命令。
We’ve seen a couple of batch scheduler commands in the previous discussion. To effectively use the system, you will need more commands, found below. These batch scheduler commands check on the status of your job, get information on the system resources, and cancel jobs. We summarize the most common commands for both the Slurm and PBS schedulers next.
大多数高性能计算站点都限制作业可以运行的最长时间。那么,如何运行更长的作业呢?典型的方法是让应用程序定期将其状态写出到文件中,然后提交一个后续作业,该作业读取文件并在运行中的该点开始。如图 15.3 所示,此过程称为 checkpointing 和 restarting。
Most high-performance computing sites limit the maximum time that a job can run. So how do you run longer jobs? The typical approach is for applications to periodically write out their state into files and then a follow-on job is submitted that reads the file and starts at that point in the run. This process, as illustrated in figure 15.3, is referred to as checkpointing and restarting.
图 15.3 在批处理作业结束时,将保存计算状态的检查点文件写出到磁盘,然后下一个批处理作业读取该文件并从上一个作业停止的位置重新开始计算。
Figure 15.3 A checkpoint file that saves the state of the calculation is written out to disk at the conclusion of a batch job and then the next batch job reads the file and restarts the calculation where the previous job left off.
检查点过程可用于处理批处理作业的有限时间以及处理系统崩溃或其他作业中断。对于少数情况,您可以手动重启作业,但随着重启次数的增加,这将成为一个真正的负担。如果是这种情况,您应该添加自动化流程的功能。执行此操作需要付出相当大的努力,并且需要更改您的应用程序和更复杂的批处理脚本。我们展示了一个框架应用程序,其中我们完成了此操作。
The checkpointing process is useful for dealing with a limited time for a batch job and for handling system crashes or other job interruptions. You might restart your jobs manually for a small number of cases, but as the number of restarts gets larger, it becomes a real burden. If this is the case, you should add the capability to automate the process. It takes a fair amount of effort to do this and requires changes to your application and more sophisticated batch scripts. We show a skeleton application where we have done this.
首先,批处理脚本需要向您的应用程序发出信号,表明它已到达其分配时间的末尾。然后,脚本需要以递归方式重新提交自身,直到您的作业完成。下面的清单显示了 Slurm 的此类脚本。
First, the batch script needs to signal your application that it is reaching the end of its allocated time. Then the script needs to resubmit itself recursively until your job reaches completion. The following listing shows such a script for Slurm.
Listing 15.3 Batch script to automatically restart
AutomaticRestarts/batch_restart.sh
1 #!/bin/sh
< ... usage notes ... >
13 #SBATCH -N 1
14 #SBATCH -n 4
15 #SBATCH --signal=23@160 ❶
16 #SBATCH -t 00:08:00
17
18 # Do not place bash commands before the last SBATCH directive
19 # Behavior can be unreliable
20
21 NUM_CPUS=${SLURM_NTASKS}
22 OUTPUT_FILE=run.out
23 EXEC_NAME=./testapp
24 MAX_RESTARTS=4 ❷
25
26 if [ -z ${COUNT} ]; then ❸
27 export COUNT=0 ❸
28 fi
29
30 ((COUNT++)) ❸
31 echo "Restart COUNT is ${COUNT}" ❸
32
33 if [ ! -e DONE ]; then ❹
34 if [ -e RESTART ]; then ❺
35 echo "=== Restarting ${EXEC_NAME} ===" \
>> ${OUTPUT_FILE}
36 cycle=`cat RESTART` ❻
37 rm -f RESTART
38 else
39 echo "=== Starting problem ===" \
>> ${OUTPUT_FILE}
40 cycle=""
41 fi
42
43 mpirun -n ${NUM_CPUS} ${EXEC_NAME} \ ❼
${cycle} &>> ${OUTPUT_FILE} ❼
44 STATUS=$?
45 echo "Finished mpirun" \
>> ${OUTPUT_FILE}
46
47 if [ ${COUNT} -ge ${MAX_RESTARTS} ]; then ❽
48 echo "=== Reached maximum number of restarts ===" \
>> ${OUTPUT_FILE}
49 date > DONE
50 fi
51
52 if [ ${STATUS} = "0" -a ! -e DONE ]; then
53 echo "=== Submitting restart script ===" \
>> ${OUTPUT_FILE}
54 sbatch <batch_restart.sh ❾
55 fi
56 fiAutomaticRestarts/batch_restart.sh
1 #!/bin/sh
< ... usage notes ... >
13 #SBATCH -N 1
14 #SBATCH -n 4
15 #SBATCH --signal=23@160 ❶
16 #SBATCH -t 00:08:00
17
18 # Do not place bash commands before the last SBATCH directive
19 # Behavior can be unreliable
20
21 NUM_CPUS=${SLURM_NTASKS}
22 OUTPUT_FILE=run.out
23 EXEC_NAME=./testapp
24 MAX_RESTARTS=4 ❷
25
26 if [ -z ${COUNT} ]; then ❸
27 export COUNT=0 ❸
28 fi
29
30 ((COUNT++)) ❸
31 echo "Restart COUNT is ${COUNT}" ❸
32
33 if [ ! -e DONE ]; then ❹
34 if [ -e RESTART ]; then ❺
35 echo "=== Restarting ${EXEC_NAME} ===" \
>> ${OUTPUT_FILE}
36 cycle=`cat RESTART` ❻
37 rm -f RESTART
38 else
39 echo "=== Starting problem ===" \
>> ${OUTPUT_FILE}
40 cycle=""
41 fi
42
43 mpirun -n ${NUM_CPUS} ${EXEC_NAME} \ ❼
${cycle} &>> ${OUTPUT_FILE} ❼
44 STATUS=$?
45 echo "Finished mpirun" \
>> ${OUTPUT_FILE}
46
47 if [ ${COUNT} -ge ${MAX_RESTARTS} ]; then ❽
48 echo "=== Reached maximum number of restarts ===" \
>> ${OUTPUT_FILE}
49 date > DONE
50 fi
51
52 if [ ${STATUS} = "0" -a ! -e DONE ]; then
53 echo "=== Submitting restart script ===" \
>> ${OUTPUT_FILE}
54 sbatch <batch_restart.sh ❾
55 fi
56 fi
❶ 在终止前 23 秒 (SIGURG) 160 秒向应用程序发送信号
❶ Sends application a signal 23 (SIGURG) 160 s before termination
❷ Maximum number of script submissions
❸ Counts the number of submissions
❻ Gets the iteration number for the command line
❼ Invokes MPI job with command-line arguments
❽ Exits if reached maximum restarts
这个脚本有很多活动部分。这在很大程度上是为了避免提交的批处理作业多于需要的失控情况。该脚本还需要与应用程序合作。这种合作包括以下任务:
This script has a lot of moving parts. Much of this is to avoid a runaway situation where more batch jobs are submitted than needed. The script also requires cooperation with the application. This cooperation includes these tasks:
The batch system sends a signal and the application catches it.
The application writes out to a file named DONE when complete.
The application writes out the iteration number to a file named RESTART.
The application writes out a checkpoint file and reads it on restart.
信号编号可能需要根据批处理系统和 MPI 已经使用的内容而变化。我们还提醒您不要在任何 Slurm 命令之前放置 shell 命令。虽然脚本似乎有效,但我们发现信号无法正常工作;因此,顺序确实很重要,您不会总是遇到明显的失败。清单 15.4 显示了一个 C 语言应用程序代码的框架,用于演示自动重启功能。
The signal number might need to vary depending on what the batch system and MPI already use. We also caution you not to put shell commands before any of the Slurm commands. While the script might seem to work, we found that the signals did not function properly; therefore, order does matter and you won’t always get an obvious failure. Listing 15.4 shows a skeleton of an application code in C to demonstrate the automatic restart functionality.
注意https://github.com/EssentialsofParallelComputing/Chapter15 的示例代码还包含自动重新启动的 Fortran 示例。
Note The example codes at https://github.com/EssentialsofParallelComputing/Chapter15 also contain a Fortran example of an automatic restart.
Listing 15.4 Sample application for testing
AutomaticRestarts/testapp.c 1 #include <unistd.h> 2 #include <time.h> 3 #include <stdio.h> 4 #include <stdlib.h> 5 #include <signal.h> 6 #include <mpi.h> 7 8 static int batch_terminate_signal = 0; ❶ 9 void batch_timeout(int signum){ ❷ 10 printf("Batch Timeout : %d\n",signum); 11 batch_terminate_signal = 1; ❷ 12 return; 13 } 14 15 int main(int argc, char *argv[]) 16 { 17 MPI_Init(&argc, &argv); 18 char checkpoint_name[50]; 19 int mype, itstart = 1; 20 MPI_Comm_rank(MPI_COMM_WORLD, &mype); 21 22 if (argc >=2) itstart = atoi(argv[1]); // < ... read restart file ... > ❸ 24 25 if (mype ==0) signal(23, batch_timeout); ❹ 26 27 for (int it=itstart; it < 10000; it++){ 28 sleep(1); ❺ 29 30 if ( it%60 == 0 ) { // < ... write out checkpoint file ... > ❻ 40 } 41 int terminate_sig = batch_terminate_signal; 42 MPI_Bcast(&terminate_sig, 1, MPI_INT, 0, MPI_COMM_WORLD); 43 if ( terminate_sig ) { // < ... write out RESTART and ❼ // special checkpoint file ... > ❼ 54 MPI_Finalize(); 55 exit(0); 56 } 57 58 } 59 // < ... write out DONE file ... > ❽ 67 MPI_Finalize(); 68 return(0); 69 }
AutomaticRestarts/testapp.c 1 #include <unistd.h> 2 #include <time.h> 3 #include <stdio.h> 4 #include <stdlib.h> 5 #include <signal.h> 6 #include <mpi.h> 7 8 static int batch_terminate_signal = 0; ❶ 9 void batch_timeout(int signum){ ❷ 10 printf("Batch Timeout : %d\n",signum); 11 batch_terminate_signal = 1; ❷ 12 return; 13 } 14 15 int main(int argc, char *argv[]) 16 { 17 MPI_Init(&argc, &argv); 18 char checkpoint_name[50]; 19 int mype, itstart = 1; 20 MPI_Comm_rank(MPI_COMM_WORLD, &mype); 21 22 if (argc >=2) itstart = atoi(argv[1]); // < ... read restart file ... > ❸ 24 25 if (mype ==0) signal(23, batch_timeout); ❹ 26 27 for (int it=itstart; it < 10000; it++){ 28 sleep(1); ❺ 29 30 if ( it%60 == 0 ) { // < ... write out checkpoint file ... > ❻ 40 } 41 int terminate_sig = batch_terminate_signal; 42 MPI_Bcast(&terminate_sig, 1, MPI_INT, 0, MPI_COMM_WORLD); 43 if ( terminate_sig ) { // < ... write out RESTART and ❼ // special checkpoint file ... > ❼ 54 MPI_Finalize(); 55 exit(0); 56 } 57 58 } 59 // < ... write out DONE file ... > ❽ 67 MPI_Finalize(); 68 return(0); 69 }
❶ Global variable for batch signal
❷ Callback function sets the global variable
❸ If a restart, reads the checkpoint file
❹ Sets the callback function for signal 23
❺ Stands in for computational work
❻ Writes out checkpoint every 60 iterations
❼ Writes out special checkpoint file and a file named RESTART
❽ Writes out DONE file when application meets completion criteria
这似乎是一个简短的代码,但这些行中包含了很多内容。一个真实的应用程序需要数百行代码来完全实现检查点和重启、完成条件和输入处理。我们还提醒开发人员需要仔细检查他们的代码,以防止出现失控的情况。还需要调整信号时序,了解捕获信号、完成迭代和写出重启文件所需的时间。对于我们的自动重启应用程序的小框架,我们以
This may appear to be a short and simple code, but there is a lot packed into these lines. A real application would need hundreds of lines to fully implement checkpointing and restart, completion criteria, and input handling. We also caution that developers need to carefully check their code to prevent runaway conditions. The signal timing also needs to be tuned for how long it takes to catch the signal, complete the iterations, and write out the restart file. For our little skeleton for an automatic restart application, we start the submission with
sbatch < batch_restart.sh
sbatch < batch_restart.sh
=== Starting problem === App launch reported: 2 (out of 2) daemons - 0 (out of 4) procs 60 Checkpoint: Mon May 11 20:06:08 2020 120 Checkpoint: Mon May 11 20:07:08 2020 180 Checkpoint: Mon May 11 20:08:08 2020 240 Checkpoint: Mon May 11 20:09:08 2020 Batch Timeout : 23 297 RESTART: Mon May 11 20:10:05 2020 Finished mpirun === Submitting restart script === === Restarting ./testapp === App launch reported: 2 (out of 2) daemons - 0 (out of 4) procs 300 Checkpoint: Mon May 11 20:10:11 2020 < ... skipping output ... > 1186 RESTART: Mon May 11 20:25:05 2020 Finished mpirun === Reached maximum number of restarts ===
=== Starting problem === App launch reported: 2 (out of 2) daemons - 0 (out of 4) procs 60 Checkpoint: Mon May 11 20:06:08 2020 120 Checkpoint: Mon May 11 20:07:08 2020 180 Checkpoint: Mon May 11 20:08:08 2020 240 Checkpoint: Mon May 11 20:09:08 2020 Batch Timeout : 23 297 RESTART: Mon May 11 20:10:05 2020 Finished mpirun === Submitting restart script === === Restarting ./testapp === App launch reported: 2 (out of 2) daemons - 0 (out of 4) procs 300 Checkpoint: Mon May 11 20:10:11 2020 < ... skipping output ... > 1186 RESTART: Mon May 11 20:25:05 2020 Finished mpirun === Reached maximum number of restarts ===
从输出中,我们看到应用程序每 60 次迭代写出定期检查点文件。因为计算工作的替身实际上是 1 s 的 sleep 命令,所以 checkpoint 相隔 1 分钟。大约 300 秒后,批处理系统发送信号,测试应用程序报告它已被捕获。此时,脚本会写出一个名为 RESTART 的文件,其中包含迭代编号。然后,该脚本会写出一条消息,指出重新启动脚本已重新提交。输出还显示应用程序已重新启动。在输出中,我们跳过了显示其他重启,只显示已达到最大重启次数的消息。
From the output, we see that the application writes out periodic checkpoint files every 60 iterations. Because the stand-in for computation work is actually a sleep command of 1 s, the checkpoints are 1 min apart. After approximately 300 s, the batch system sends the signal and the test application reports that it was caught. At that point, the script writes out a file named RESTART that contains the iteration number. The script then writes out a message that the restart script was resubmitted. The output also shows the application starting back up. In the output, we skipped showing the additional restarts and just showed the message that the maximum number of restarts has been reached.
批处理系统是否内置了对批处理作业序列的支持?大多数都具有依赖项功能,允许您指定一个作业如何依赖于另一个作业。使用此依赖项功能,我们可以在运行应用程序之前提交下一个批处理作业,从而在队列中更早地提交后续作业。如图 15.4 所示,这可能会为我们启动下一个批处理作业提供更高的优先级,具体取决于站点的策略。无论如何,您的作业都会在队列中,您不必担心是否会提交下一个作业。
Do batch systems have built-in support for sequences of batch jobs? Most have a dependency feature that allows you to specify how one job depends on another. Using this dependency capability, we can get our subsequent jobs submitted earlier in the queue by submitting the next batch job prior to running our application. As figure 15.4 shows, this may give us higher priority for starting up the next batch job, depending on the policies of the site. Regardless, your jobs will be in the queue, and you don’t have to worry about whether the next job will be submitted.
图 15.4 在批处理作业开始时提交的自动重启在队列中将有更多时间,这可能会使重启作业的优先级高于在批处理作业结束时提交的重启作业(取决于本地计划策略)。
Figure 15.4 Automatic restart submitted at start of batch job will have more time in queue, which can give your restart job higher priority than one submitted at end of batch job (dependent on local scheduling policies).
我们可以通过添加 dependency 子句(在下面清单中的第 33 行)来对批处理脚本进行此更改。在我们开始工作之前,首先提交此批处理脚本,但依赖于此当前批处理作业的完成。
We can make this change to the batch script by adding the dependency clause (on line 33 in the following listing). This batch script is submitted first, before we begin our work, but with a dependency on the completion of this current batch job.
Listing 15.5 Batch script to submit first to restart script
Prestart/batch_restart.sh
1 #!/bin/sh
< ... usage notes ... >
13 #SBATCH -N 1
14 #SBATCH -n 4
15 #SBATCH --signal=23@160
16 #SBATCH -t 00:08:00
17
18 # Do not place bash commands before the last SBATCH directive
19 # Behavior can be unreliable
20
21 NUM_CPUS=4
22 OUTPUT_FILE=run.out
23 EXEC_NAME=./testapp
24 MAX_RESTARTS=4
25
26 if [ -z ${COUNT} ]; then
27 export COUNT=0
28 fi
29
30 ((COUNT++))
31 echo "Restart COUNT is ${COUNT}"
32
33 if [ ! -e DONE ]; then
34 if [ -e RESTART ]; then
35 echo "=== Restarting ${EXEC_NAME} ===" \
>> ${OUTPUT_FILE}
36 cycle=`cat RESTART`
37 rm -f RESTART
38 else
39 echo "=== Starting problem ===" \
>> ${OUTPUT_FILE}
40 cycle=""
41 fi
42
43 echo "=== Submitting restart script ===" \
>> ${OUTPUT_FILE}
44 sbatch --dependency=afterok:${SLURM_JOB_ID} \
<batch_restart.sh ❶
45
46 mpirun -n ${NUM_CPUS} ${EXEC_NAME} ${cycle} \
&>> ${OUTPUT_FILE}
47 echo "Finished mpirun" \
>> ${OUTPUT_FILE}
48
49 if [ ${COUNT} -ge ${MAX_RESTARTS} ]; then
50 echo "=== Reached maximum number of restarts ===" \
>> ${OUTPUT_FILE}
51 date > DONE
52 fi
53 fiPrestart/batch_restart.sh
1 #!/bin/sh
< ... usage notes ... >
13 #SBATCH -N 1
14 #SBATCH -n 4
15 #SBATCH --signal=23@160
16 #SBATCH -t 00:08:00
17
18 # Do not place bash commands before the last SBATCH directive
19 # Behavior can be unreliable
20
21 NUM_CPUS=4
22 OUTPUT_FILE=run.out
23 EXEC_NAME=./testapp
24 MAX_RESTARTS=4
25
26 if [ -z ${COUNT} ]; then
27 export COUNT=0
28 fi
29
30 ((COUNT++))
31 echo "Restart COUNT is ${COUNT}"
32
33 if [ ! -e DONE ]; then
34 if [ -e RESTART ]; then
35 echo "=== Restarting ${EXEC_NAME} ===" \
>> ${OUTPUT_FILE}
36 cycle=`cat RESTART`
37 rm -f RESTART
38 else
39 echo "=== Starting problem ===" \
>> ${OUTPUT_FILE}
40 cycle=""
41 fi
42
43 echo "=== Submitting restart script ===" \
>> ${OUTPUT_FILE}
44 sbatch --dependency=afterok:${SLURM_JOB_ID} \
<batch_restart.sh ❶
45
46 mpirun -n ${NUM_CPUS} ${EXEC_NAME} ${cycle} \
&>> ${OUTPUT_FILE}
47 echo "Finished mpirun" \
>> ${OUTPUT_FILE}
48
49 if [ ${COUNT} -ge ${MAX_RESTARTS} ]; then
50 echo "=== Reached maximum number of restarts ===" \
>> ${OUTPUT_FILE}
51 date > DONE
52 fi
53 fi
❶ Submit this batch job first with a dependency on its completion
此清单展示了如何在批处理脚本中使用依赖项来处理 checkpoint/restart 的简单情况,但依赖项对于许多其他情况都很有用。更复杂的工作流可能具有预处理步骤,需要在主要工作之前完成这些步骤,然后在之后完成后处理步骤。一些更复杂的工作流需要的不仅仅是依赖于上一个作业是否完成。幸运的是,批处理系统在作业之间提供了其他类型的依赖关系。Table 15.3 显示了各种可能的选项。PBS 对批处理作业具有类似的依赖关系,可以使用 -W depend=<type:job id> 指定。
This listing showed how to use dependencies in your batch scripts for the simple case of a checkpoint/restart, but dependencies are useful for many other situations. More complicated workflows might have pre-processing steps that need to complete before the main work and then a post-processing step afterward. Some more complex workflows need more than a dependency on whether the previous job completed. Fortunately, batch systems provide other types of dependencies between jobs. Table 15.3 shows the various possible options. PBS has similar dependencies for batch jobs that can be specified with -W depend=<type:job id>.
Table 15.3 Dependency options for batch jobs
有 Slurm 和 PBS 调度程序的一般参考资料,但您还应该查看您站点的文档。许多站点都自定义了设置,并添加了满足其特定需求的命令和功能。如果您认为您可能希望使用批处理系统设置计算集群,则可能需要研究新计划,例如 OpenHPC 和最近针对不同 HPC 计算细分市场发布的 Rocks Cluster 发行版。
There are general reference materials for the Slurm and PBS schedulers, but you should also look at the documentation for your site. Many sites have customized setups and added commands and features for their specific needs. If you think you might want to set up a computing cluster with a batch system, you may want to research new initiatives such as OpenHPC and the Rocks Cluster distributions that have recently been released for different HPC computing niches.
SchedMD 提供免费和商业支持的 Slurm 版本均可获得。毫不奇怪,SchedMD 网站有很多关于 Slurm 的文档。另一个很好的参考站点是劳伦斯利弗莫尔国家实验室,Slurm 最初就是在这里开发的。
Both freely available and commercially supported versions of Slurm are available from SchedMD. Not surprisingly, the SchedMD site has a lot of documentation on Slurm. Another good reference site is Lawrence Livermore National Laboratory where Slurm was originally developed.
https://slurm.schedmd.com 的 SchedMD 和 Slurm 文档。
SchedMD and Slurm documentation at https://slurm.schedmd.com.
布莱斯·巴尼,“Slurm and Moab”,劳伦斯利弗莫尔国家实验室,https://computing.llnl.gov/tutorials/moab/。
Blaise Barney, “Slurm and Moab,” Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/moab/.
The best information on PBS is the PBS User Guide:
Altair Engineering,PBS 用户指南,https://www.altair.com/pdfs/pbsworks/PBSUserGuide2021.1.pdf。
Altair Engineering, PBS User Guide, https://www.altair.com/pdfs/pbsworks/PBSUserGuide2021.1.pdf.
虽然有些过时,但以下关于设置 Beowulf 集群的在线参考资料是关于集群计算的出现以及如何设置集群管理(包括 PBS 批处理调度程序)的良好历史视角:
Though somewhat dated, the following online reference to setting up a Beowulf cluster is a good historical perspective on the emergence of cluster computing and how to set up cluster management, including the PBS batch scheduler:
由 William Gropp、Ewing Lusk、Thomas String 编辑,Beowulf Cluster Computing with Linux,第 2 版(麻省理工学院,2002 年、2003 年),http:// etutorials.org/Linux+systems/cluster+computing+with+linux/。
Edited by William Gropp, Ewing Lusk, Thomas String, Beowulf Cluster Computing with Linux, 2nd ed. (Massachusetts Institute of Technology, 2002, 2003), http:// etutorials.org/Linux+systems/cluster+computing+with+linux/.
Here are some sites with information on current HPC software management systems:
OpenHPC,http://www.openhpc.community。
OpenHPC, http://www.openhpc.community.
http://www.rocksclusters.org Rocks Cluster。
Rocks Cluster, http://www.rocksclusters.org.
尝试提交几个作业,一个具有 32 个处理器,另一个具有 16 个处理器。检查这些内容是否已提交以及它们是否正在运行。删除 32 处理器作业。检查它是否已被删除。
Try submitting a couple of jobs, one with 32 processors and one with 16 processors. Check to see that these are submitted and whether they are running. Delete the 32 processor job. Check to see that it got deleted.
Modify the automatic restart script so that the first job is a preprocessing step to set up for the computation before the restarts run the simulation.
修改清单 15.1 中用于 Slurm 和 15.2 for PBS 中的简单批处理脚本,以通过删除名为 simulation_database 的文件来清理失败。
Modify the simple batch script in listing 15.1 for Slurm and 15.2 for PBS to clean up on failure by removing a file called simulation_database.
Batch 计划程序分配资源,以便您可以有效地使用并行集群。了解如何使用这些 API 在更大的高性能计算系统上运行非常重要。
Batch schedulers allocate resources so that you can use a parallel cluster efficiently. It is important to learn how to use these to run on larger, high-performance computing systems.
There are many commands to query your job and its status. Knowing these commands allows you to better utilize the system.
You can use automatic restarts and chaining of jobs to run larger simulations and workflows. Adding this capability to your application makes it possible to scale to problems that you would not otherwise be able to do.
批处理作业依赖关系提供了控制复杂工作流的功能。通过使用多个作业之间的依赖关系,您可以暂存数据、对数据进行预处理以进行计算或启动后处理作业。
Batch job dependencies give the capability of controlling complex workflows. By using dependencies between multiple jobs, you can stage data, preprocess it for a calculation, or launch a post-processing job.
文件系统创建了检索、存储和更新数据的简化工作流程。对于任何计算工作,产品都是输出,无论是数据、图形还是统计数据。这包括最终结果,但也包括图形、检查点和分析的中间输出。检查点是大型 HPC 系统的特殊需求,其长时间运行的计算可能持续数天、数周或数月。
Filesystems create a streamlined workflow of retrieving, storing, and updating data. For any computing work, the product is the output, whether it be data, graphics, or statistics. This includes final results but also intermediate output for graphics, checkpointing, and analysis. Checkpointing is a special need on large HPC systems with long-running calculations that might span days, weeks, or months.
定义检查点是定期将计算状态存储到磁盘的做法,以便在系统发生故障或由于批处理系统中的有限长度运行时间而可以重新启动计算
Definition Checkpointing is the practice of periodically storing the state of a calculation to disk so that the calculation can be restarted in the event of system failures or because of finite length run times in a batch system
在为高度并行的应用程序处理数据时,需要有一种安全且高性能的方式在运行时读取和存储数据。这就是理解并行世界中的文件操作的需求。您应该记住的一些问题是正确性、减少重复输出和性能。
When processing data for highly parallel applications, there needs to be a safe and performant way of reading and storing data at run time. Therein lies the need to understand file operations in a parallel world. Some of the concerns you should keep in mind are correctness, reducing duplicate output, and performance.
重要的是要意识到,文件系统性能的扩展并没有跟上其他计算硬件的步伐。我们正在将计算扩展到数十亿个单元或粒子,这对文件系统提出了很高的要求。随着机器学习和数据科学的出现,越来越多的应用程序需要大数据,这些数据需要大型数据集和具有中间文件存储的复杂工作流程。
It is important to be aware that the scaling of the performance of filesystems has not kept up with the rest of the computing hardware. We are scaling calculations up to billions of cells or particles, which is putting severe demands on the filesystems. With the advent of machine learning and data science, many more applications need big data that requires large files sets and complex workflows with intermediate file storage.
将对文件操作的理解添加到您的 HPC 工具集中变得越来越重要。在本章中,我们将介绍如何修改并行应用程序的文件操作,以便您有效地写出数据并充分利用可用的硬件。尽管许多并行教程可能没有深入介绍这个主题,但我们认为它是当今并行应用程序的基本基础。您将学习如何在保持正确性的同时将文件写入操作加快几个数量级。我们还将研究通常用于大型 HPC 系统的不同软件和硬件。我们将使用不同的并行文件软件从带有 halo 单元的规则网格的域分解中写出数据。我们鼓励您按照 https://github.com/EssentialsOfParallelComputing/Chapter16.git 中的本章示例进行操作。
Adding an understanding of file operations to your HPC toolset is becoming more and more important. In this chapter, we introduce how to modify file operations for a parallel application so that you are writing out data efficiently and making the best use of the available hardware. Though this topic may not be heavily covered in many parallel tutorials, we think it’s a baseline essential for today’s parallel applications. You will learn how to speed up the file-writing operation by orders of magnitude while maintaining correctness. We will also look at the different software and hardware that are typically used for large HPC systems. We will use the example of writing out the data from the domain decomposition of a regular grid with halo cells using different parallel file software. We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsOfParallelComputing/Chapter16.git.
我们首先回顾一下高性能文件系统由哪些硬件组成。传统上,文件操作通过机械机制将数据存储到硬盘上,该机械机制将一系列位写入磁性基板。与 HPC 系统的许多其他部分一样,存储硬件也变得更加复杂,硬件层次结构更深,性能特征也不同。存储硬件的这种演变类似于处理器缓存层次结构的深化,因为这些处理器的性能得到了提高。与机械磁盘存储相比,存储层次结构还有助于覆盖处理器级别带宽的巨大差异。这是因为减小机械部件的尺寸比减小电路的尺寸要困难得多。固态驱动器 (SSD) 和其他固态设备的引入有助于提供一种绕过物理旋转磁盘扩展的方法。
We first review what hardware comprises a high-performance filesystem. Traditionally, file operations store data to a hard disk with a mechanical mechanism that writes a series of bits to a magnetic substrate. Like many other parts of HPC systems, the storage hardware has become more complex with deeper hierarchies of hardware and different performance characteristics. This evolution of storage hardware is similar to the deepening of the cache hierarchy for processors as these increased in performance. The storage hierarchy also helps to cover the large disparity in bandwidth at the processor level, compared to mechanical disk storage. This is because it is much harder to reduce the size of mechanical components than electrical circuits. The introduction of solid-state drives (SSDs) and other solid-state devices has helped to provide a way around the scaling of physical spinning disks.
让我们首先指定 HPC 存储系统的组成,如图 16.1 所示。典型的存储硬件组件包括:
Let’s first specify what might comprise an HPC storage system as illustrated in figure 16.1. Typical storage hardware components include the following:
Spinning disk—Electro-mechanical device where data is stored in an electro-magnetic layer through the movement of a mechanical recording head.
SSD—A solid-state drive (SSD) is a solid-state memory device that can replace a mechanical disk.
Burst buffer—Intermediate storage hardware layer composed of NVRAM and SSD components. It is positioned between the compute hardware and the main disk storage resources.
图 16.1 中的存储示意图说明了计算系统和存储系统之间的存储层次结构。在计算硬件和主磁盘存储之间插入了突发缓冲区,以填补日益扩大的性能差距。突发缓冲区可以放置在每个节点上,也可以放置在 IO 节点上,并通过网络与其他计算节点共享。
The storage schematic in figure 16.1 illustrates the storage hierarchy between the compute system and the storage system. Burst buffers are inserted in between the compute hardware and the main disk storage to cover the increasing gap in performance. Burst buffers can either be placed on each node or on the IO nodes and shared via a network with the other compute nodes.
图 16.1 显示突发缓冲区硬件在计算资源和磁盘存储之间的位置的示意图。突发缓冲区可以是节点本地的,也可以通过网络在节点之间共享。
Figure 16.1 Schematic showing positioning of burst buffer hardware in between the compute resources and disk storage. Burst buffers can either be node-local or shared among nodes via a network.
随着固态存储技术的快速发展,突发缓冲器设计将在不久的将来继续发展。除了帮助缩小延迟和带宽性能方面的差距外,随着系统规模的增长,新的存储设计越来越多地受到降低电源需求的需求的推动。磁带传统上用于长期存储,但一些设计甚至着眼于“暗盘”,其中使用旋转磁盘,但在不需要时关闭。
With the rapid development of solid-state storage technology, the burst buffer designs will continue to evolve in the near future. Besides helping with the gap in latency and bandwidth performance, new storage designs are increasingly driven by the need to reduce power requirements as systems grow in size. A magnetic tape has traditionally been used for long-term storage, but some designs have even looked at a “dark disk,” where spinning disks are used but turned off when not needed.
我们首先看一下标准文件操作。对于我们的并行应用程序,传统的文件处理接口仍然是串行操作。为每个处理器都配备一个硬盘是不切实际的。即使每个进程一个文件,也只有在有限的情况下和小规模下才可行。结果是,对于每个文件操作,我们从并行变为串行。需要将文件操作视为进程数量的减少(或读取的扩展),需要对并行应用程序进行特殊处理。您可以通过对标准文件输入和输出 (IO) 进行一些简单的修改来处理这种并行性。
Let’s first take a look at standard file operations. For our parallel applications, the conventional file-handling interface is still a serial operation. It is not practical to have a hard disk for every processor. Even a file per process is only viable in limited situations and at small scale. The result is that for every file operation, we go from parallel to serial. A file operation needs to be treated as a reduction (or expansion for reads) in the number of processes, requiring special handling for parallel applications. You can handle this parallelism with some simple modifications to standard file input and output (IO).
并行应用程序的很大一部分修改是在文件操作界面上进行的。我们应该首先回顾一下之前涉及文件操作的示例。有关文件 input 的示例,Section 8.3.2 显示了如何在一个进程上读入数据,然后将其广播到其他进程。在第 8.3.4 节中,我们使用了 MPI gather 操作,以便以确定性顺序写出进程的输出。
A large portion of the modifications for parallel applications is at the file-operation interface. We should first review our prior examples that involved file operations. For an example of file input, section 8.3.2 shows how to read in data on one process and then broadcast it to other processes. In section 8.3.4, we used an MPI gather operation so that output from processes is written out in a deterministic order.
(专业提示)为避免以后出现复杂情况,在并行化应用程序时应采取的第一步是遍历代码并在每个输入和输出语句前面插入 if (rank == 0)。在浏览代码时,您应该确定哪些文件操作需要额外处理。这些操作包括以下内容(如图 16.2 所示)。
(Pro tip) To avoid later complications, the first step you should take in parallelizing an application is to go through the code and insert an if (rank == 0) in front of every input and output statement. While going through the code, you should identify which file operations need additional treatment. These operations include the following (illustrated in figure 16.2).
Opening files on only one process and then broadcasting the data to other processes
Distributing data that needs to be partitioned across processes with a scatter operation
Collecting the distributed data with a gather operation before it is output
图 16.2 对并行应用程序使用标准文件系统的修改。所有文件操作均从排名 0 开始完成。
Figure 16.2 Modifications for a parallel application to work with a standard filesystem. All file operations are done from rank 0.
一个常见的低效率是在每个进程上打开一个文件;你可以想象这相当于十几个人试图同时打开一扇门。虽然您的程序可能不会崩溃,但它会导致大规模问题(想象一下 1,000 人打开同一扇门)。文件元数据和它导致的正确性锁定存在很多争用,如果进程计数较大,这可能需要几分钟。我们可以通过只在一个进程上打开文件来避免这种争用。通过在从串行到并行和并行到串行的每个转换点添加并行通信调用,我们可以使适度的并行应用程序使用标准文件工作。这对于绝大多数并行应用程序来说已经足够了。
A common inefficiency is to open a file on every process; you can imagine it being equivalent to a dozen people trying to open a door at the same time. While your program might not crash, it causes problems at scale (imagine 1,000 people opening that same door). There’s a lot of contention for the file metadata and the lock for correctness that it causes, which can take minutes at larger process counts. We can avoid this contention by opening the file on just one process. By adding parallel communication calls at each of the transition points from serial to parallel and parallel to serial, we can make modest parallel applications work using standard files. This is sufficient for the vast majority of parallel applications.
随着应用程序规模的增长,我们无法再轻松地将数据收集或分散到单个进程中。我们最大的限制是内存;我们在单个进程上没有足够的内存资源,无法将来自数千个其他进程的数据减少到一个。因此,我们必须有一种不同的、更具可扩展性的文件操作方法。这是接下来两节关于 MPI 文件操作(称为 MPI-IO)和分层数据格式 v5 (HDF5) 的主题。在这些部分中,我们将展示这两个库如何允许并行应用程序以并行方式处理文件操作。我们将在 Section 16.5 中提到其他并行文件库。
As our applications grow in size, we can no longer easily gather or scatter the data to a single process. Our biggest limitation is memory; we don’t have enough memory resources on a single process to bring the data from thousands of other processes down to just one. Thus, we have to have a different, more scalable approach to file operations. That is the subject of the next two sections on MPI file operations, called MPI-IO, and Hierarchical Data Format v5 (HDF5). In these sections, we show how these two libraries permit a parallel application to treat file operations in a parallel manner. There are other parallel file libraries that we will mention in section 16.5.
学习 MPI-IO 的最佳方法是了解如何在实际场景中使用它。我们将看一下使用 MPI-IO 写出一个常规网格的示例,该网格已分布在具有 halo 单元的处理器之间。通过此示例,您将熟悉 MPI-IO 的基本结构及其一些更常见的函数调用。
The best way to learn MPI-IO is to see how it is used in a realistic scenario. We’ll take a look at the example of writing out a regular mesh that has been distributed across processors with halo cells using MPI-IO. Through this example, you will become familiar with the basic structure that occurs with MPI-IO and some of its more common function calls.
第一个并行文件操作是在 1990 年代后期的 MPI-2 标准中添加到 MPI 中的。MPI 文件操作的第一个广泛可用的实现 ROMIO 由阿贡国家实验室 (ANL) 的 Rajeev Thakur 领导。ROMIO 可以与任何 MPI 实现一起使用。大多数 MPI 发行版将 ROMIO 作为其软件版本的标准部分。MPI-IO 有很多功能,都以前缀 MPI_File 开头。在本节中,我们将仅介绍最常用操作的子集(请参见表 16.1)。
The first parallel file operations were added to MPI in the MPI-2 standard in the late 1990s. The first widely available implementation of the MPI file operations, ROMIO, was led by Rajeev Thakur at Argonne National Laboratory (ANL). ROMIO can be used with any MPI implementation. Most MPI distributions include ROMIO as a standard part of their software release. MPI-IO has a lot of functions, all beginning with the prefix MPI_File. In this section, we will cover just a subset of the most commonly used operations (see table 16.1).
MPI-IO 有多种使用方法。我们对高度并行版本感兴趣,这种集体形式让各个进程协同工作以写入文件的各个部分。为此,我们将利用在 8.5.1 节中首次介绍的新 MPI 数据类型的功能。
There are different ways to use MPI-IO. We are interested in the highly parallel version, the collective form that has the processes work together to write to their section of the file. In order to do this, we’ll utilize the ability to create a new MPI data type that was first introduced in section 8.5.1.
MPI-IO 库既具有跨所有进程的共享文件指针,又具有每个进程的独立文件指针。使用共享指针会导致为每个进程应用一个锁,并序列化文件操作。为了避免锁定,我们使用独立的文件指针以获得更好的性能。
The MPI-IO library has both a shared file pointer across all processes and independent file pointers for each process. Using the shared pointer causes a lock to be applied for each process and serializes the file operations. To avoid the locks, we use the independent file pointers for better performance.
文件操作分为集体操作和非集体操作。集合操作使用 MPI 集合通信调用,通信器的所有成员都必须进行调用,否则它将挂起。非集体调用是为每个进程单独调用的串行操作。Table 16.1 显示了一些通用操作以及每个操作的相应命令。
File operations are broken down into collective and non-collective operations. Collective operations use the MPI collective communication calls, and all members of the communicator must make the call or it will hang. Non-collective calls are serial operations that are invoked separately for every process. Table 16.1 shows some general-purpose operations and the respective commands for each.
Table 16.1 MPI general file routines
|
将提示传达给 MPI-IO 库,以实现更优化的 MPI 操作 Communicates hints to the MPI-IO library for more optimized MPI operations |
文件打开和关闭操作是不言自明的。查找操作将单个文件指针移动到每个进程的指定位置。您可以使用 MPI_ File_set_info 来传达常规提示和供应商特定的提示。还有一个 MPI_File_delete,但它是一个非集体调用。在这种情况下,我们的意思是将非集体调用称为串行调用:每个进程都会删除文件。对于 C 和 C++ 程序,remove 函数也同样有效。使用预期的文件大小调用 MPI_File_set_size 可能比每次写入时逐渐增加文件大小更有效。
The file open and close operations are self-explanatory. The seek operation moves the individual file pointer to the specified location for each process. You can use MPI_ File_set_info to communicate both general- and vendor-specific hints. There is also an MPI_File_delete, but it is a non-collective call. In this case, we mean a non-collective call to be a serial call: every process deletes the file. For C and C++ programs, the remove function works just as well. Calling MPI_File_set_size with the expected size of your file can be more efficient than the file being incrementally increased in size with each write.
我们首先查看读取和写入操作的独立文件操作。当每个进程都对其独立的文件指针进行操作时,它称为独立文件操作。独立的文件操作对于跨进程写出复制的数据非常有用。对于此公共数据,您可以使用 table 16.2 中的例程从单个 rank 中写出它。
We’ll start by looking at the independent file operations for the read and write operations. When each process operates on its independent file pointer, it’s known as an independent file operation. Independent file operations are useful for writing out replicated data across processes. For this common data, you can write it out from a single rank with the routines in table 16.2.
Table 16.2 MPI independent file routines
|
Moves the file pointer to the specified location and reads the data |
|
|
Moves the file pointer to the specified location and writes the data |
您应该使用集体操作写出分布式数据(表 16.3)。当进程对文件进行集体操作时,它称为集体文件操作。写入和读取函数类似于独立的文件操作,但函数名称后附加了 _all。为了充分利用集合运算,我们需要创建复杂的 MPI 数据类型。MPI_File_set_view 函数用于设置文件中的数据布局。
You should write out distributed data with collective operations (table 16.3). When processes operate collectively on the file, it’s known as a collective file operation. The write and read functions are similar to the independent file operations but with an _all appended to the function name. To make the best use of the collective operations, we need to create complex MPI data types. The MPI_File_set_view function is used to set the data layout in the file.
Table 16.3 MPI collective file routines
在此示例中,我们将代码分为四个块。(此示例的完整代码包含在本章的代码中。首先,我们必须首先为数据的内存布局创建一个 MPI 数据类型,并为文件布局创建另一个 MPI 数据类型;它们分别称为 Memspace 和 Filespace。图 16.3 显示了我们示例的较小 4×4 版本的数据类型。为简单起见,我们只展示了四个过程,每个过程都有一个 4×4 网格,周围环绕着一个细胞光晕。图中的晕深大小为 ng,是 number of ghost cells 的缩写。
For this example, we’ll break up the code into four blocks. (The full code for this example is included with the code for the chapter.) To begin, we must start with the creation of an MPI data type for the memory layout of the data and another for the file layout; these are referred to as memspace and filespace, respectively. Figure 16.3 shows these data types for a smaller 4×4 version of our example. For simplicity, we only show four processes, each with a 4×4 grid surrounded by a one cell halo. The halo depth size in the figure is ng, short for number of ghost cells.
图 16.3 来自每个进程的 4x4 数据块,在没有 halo 单元的情况下写入输出文件的连续部分。顶行是进程上的内存布局,称为 memspace。中间行是文件中去除了 halo 单元格的内存,称为文件空间。文件中的内存实际上是线性的,因此它采用最后一行的形式。
Figure 16.3 The 4x4 blocks of data from each process written without the halo cells to contiguous sections of the output file. The top row is the memory layout on the process, referred to as the memspace. The middle row is the memory in the file with the halo cells stripped off, referred to as the filespace. The memory in the file is actually linear, so it takes the form in the last row.
清单 16.1 中的第一个代码块显示了这两种数据类型的创建。这只需要在程序开始时执行一次。然后,应在程序结束时在 finalize 例程中释放数据类型。
The first block of code in listing 16.1 shows the creation of these two data types. This only needs to be done once at the start of the program. The data types should then be freed at the end of the program in the finalize routine.
Listing 16.1 Setting up MPI-IO dataspace types
MPI_IO_Examples/mpi_io_block2d/mpi_io_file_ops.c
10 void mpi_io_file_init(int ng, int ndims, int *global_sizes,
11 int *global_subsizes, int *global_starts, MPI_Datatype *memspace,
MPI_Datatype *filespace){
12 // create data descriptors on disk and in memory
13
14 // Global view of entire 2D domain -- collates decomposed subarrays
15 MPI_Type_create_subarray(ndims, ❶
global_sizes, global_subsizes, ❶
16 global_starts, MPI_ORDER_C, MPI_DOUBLE, ❶
filespace); ❶
17 MPI_Type_commit(filespace); ❷
18
19 // Local 2D subarray structure -- strips ghost cells on node
20 int ny = global_subsizes[0], nx = global_subsizes[1];
21 int local_sizes[] = {ny+2*ng, nx+2*ng};
22 int local_subsizes[] = {ny, nx};
23 int local_starts[] = {ng, ng};
24
25 MPI_Type_create_subarray(ndim, local_sizes, ❸
local_subsizes, local_starts, ❸
26 MPI_ORDER_C, MPI_DOUBLE, memspace); ❸
27 MPI_Type_commit(memspace); ❹
28 }
29
30 void mpi_io_file_finalize(MPI_Datatype *memspace,
MPI_Datatype *filespace){
31 MPI_Type_free(memspace); ❺
32 MPI_Type_free(filespace); ❺
33 }MPI_IO_Examples/mpi_io_block2d/mpi_io_file_ops.c
10 void mpi_io_file_init(int ng, int ndims, int *global_sizes,
11 int *global_subsizes, int *global_starts, MPI_Datatype *memspace,
MPI_Datatype *filespace){
12 // create data descriptors on disk and in memory
13
14 // Global view of entire 2D domain -- collates decomposed subarrays
15 MPI_Type_create_subarray(ndims, ❶
global_sizes, global_subsizes, ❶
16 global_starts, MPI_ORDER_C, MPI_DOUBLE, ❶
filespace); ❶
17 MPI_Type_commit(filespace); ❷
18
19 // Local 2D subarray structure -- strips ghost cells on node
20 int ny = global_subsizes[0], nx = global_subsizes[1];
21 int local_sizes[] = {ny+2*ng, nx+2*ng};
22 int local_subsizes[] = {ny, nx};
23 int local_starts[] = {ng, ng};
24
25 MPI_Type_create_subarray(ndim, local_sizes, ❸
local_subsizes, local_starts, ❸
26 MPI_ORDER_C, MPI_DOUBLE, memspace); ❸
27 MPI_Type_commit(memspace); ❹
28 }
29
30 void mpi_io_file_finalize(MPI_Datatype *memspace,
MPI_Datatype *filespace){
31 MPI_Type_free(memspace); ❺
32 MPI_Type_free(filespace); ❺
33 }
❶ Creates the data type for the file data layout
❸ Creates the data type for the memory data layout
❹ Commits the memory data type
在第一步中,我们创建了图 16.1 中的两种数据类型。现在我们需要将这些数据类型写出到文件中。编写过程有四个步骤,如清单 16.2 所示:
In this first step, we created the two data types from figure 16.1. Now we need to write these data types out to the file. There are four steps to the writing process as shown in listing 16.2:
Listing 16.2 Writing an MPI-IO file
MPI_IO_Examples/mpi_io_block2d/mpi_io_file_ops.c
35 void write_mpi_io_file(const char *filename, double **data,
36 int data_size, MPI_Datatype memspace, MPI_Datatype filespace,
MPI_Comm mpi_io_comm){
37 MPI_File file_handle = create_mpi_io_file( ❶
38 filename, mpi_io_comm, (long long)data_size); ❶
39
40 MPI_File_set_view(file_handle, file_offset, ❷
41 MPI_DOUBLE, filespace, "native", ❷
MPI_INFO_NULL); ❷
42 MPI_File_write_all(file_handle, ❸
&(data[0][0]), 1, memspace, ❸
MPI_STATUS_IGNORE); ❸
43 file_offset += data_size;
44
45 MPI_File_close(&file_handle); ❹
46 file_offset = 0;
47 }
48
49 MPI_File create_mpi_io_file(const char *filename, MPI_Comm mpi_io_comm,
50 long long file_size){
51 int file_mode = MPI_MODE_WRONLY | MPI_MODE_CREATE |
MPI_MODE_UNIQUE_OPEN;
52
53 MPI_Info mpi_info = MPI_INFO_NULL; // For MPI IO hints
54 MPI_Info_create(&mpi_info);
55 MPI_Info_set(mpi_info, ❺
"collective_buffering", "1"); ❺
56 MPI_Info_set(mpi_info, ❻
"striping_factor", "8"); ❻
57 MPI_Info_set(mpi_info, ❻
"striping_unit", "4194304"); ❻
58
59 MPI_File file_handle = NULL;
60 MPI_File_open(mpi_io_comm, filename, file_mode, mpi_info,
&file_handle);
61 if (file_size > 0) ❼
MPI_File_set_size(file_handle, file_size); ❼
62 file_offset = 0;
63 return file_handle;
64 }MPI_IO_Examples/mpi_io_block2d/mpi_io_file_ops.c
35 void write_mpi_io_file(const char *filename, double **data,
36 int data_size, MPI_Datatype memspace, MPI_Datatype filespace,
MPI_Comm mpi_io_comm){
37 MPI_File file_handle = create_mpi_io_file( ❶
38 filename, mpi_io_comm, (long long)data_size); ❶
39
40 MPI_File_set_view(file_handle, file_offset, ❷
41 MPI_DOUBLE, filespace, "native", ❷
MPI_INFO_NULL); ❷
42 MPI_File_write_all(file_handle, ❸
&(data[0][0]), 1, memspace, ❸
MPI_STATUS_IGNORE); ❸
43 file_offset += data_size;
44
45 MPI_File_close(&file_handle); ❹
46 file_offset = 0;
47 }
48
49 MPI_File create_mpi_io_file(const char *filename, MPI_Comm mpi_io_comm,
50 long long file_size){
51 int file_mode = MPI_MODE_WRONLY | MPI_MODE_CREATE |
MPI_MODE_UNIQUE_OPEN;
52
53 MPI_Info mpi_info = MPI_INFO_NULL; // For MPI IO hints
54 MPI_Info_create(&mpi_info);
55 MPI_Info_set(mpi_info, ❺
"collective_buffering", "1"); ❺
56 MPI_Info_set(mpi_info, ❻
"striping_factor", "8"); ❻
57 MPI_Info_set(mpi_info, ❻
"striping_unit", "4194304"); ❻
58
59 MPI_File file_handle = NULL;
60 MPI_File_open(mpi_io_comm, filename, file_mode, mpi_info,
&file_handle);
61 if (file_size > 0) ❼
MPI_File_set_size(file_handle, file_size); ❼
62 file_offset = 0;
63 return file_handle;
64 }
❺ Communicates hints for collective operation
❻ Communicates hints for striping on Lustre filesystem
❼ Preallocates file space for better performance
在 open 期间,可以通过 MPI_Info 对象中的 hint 提供一些优化(第 53 行)。一个提示可能是文件操作应该使用集体操作来完成,collective_buffering,如第 55 行。或者,提示可以是特定于 8 个硬盘的条带化文件系统的提示,striping_factor = 8,如第 56 行所示。我们将在 16.6.1 节中更多地讨论提示。
There are a few optimizations that can be provided during the open with hints in an MPI_Info object (line 53). A hint could be that the file operations should be done using collective operations, collective_buffering, as on line 55. Or a hint can be one that’s filesystem specific to stripe across eight hard disks, striping_factor = 8, as on line 56. We will discuss hints more in section 16.6.1.
我们还可以预先分配文件空间,如第 61 行所示,这样在写入期间就不必增加它。读取文件与前面列出的写入过程具有相同的四个步骤,如下面的清单所示。
We can also preallocate the file space, as shown on line 61, so that it doesn’t have to be increased during the writes. Reading the file has the same four steps as the writing process listed previously and is shown in the following listing.
Listing 16.3 Reading an MPI-IO file
MPI_IO_Examples/mpi_io_block2d/mpi_io_file_ops.c
66 void read_mpi_io_file(const char *filename, double **data, int data_size,
67 MPI_Datatype memspace, MPI_Datatype filespace, MPI_Comm mpi_io_comm){
68 MPI_File file_handle = open_mpi_io_file( ❶
filename, mpi_io_comm); ❶
69
70 MPI_File_set_view(file_handle, file_offset, ❷
71 MPI_DOUBLE, filespace, "native", ❷
MPI_INFO_NULL); ❷
72 MPI_File_read_all(file_handle, ❸
&(data[0][0]), 1, memspace, ❸
MPI_STATUS_IGNORE); ❸
73 file_offset += data_size;
74
75 MPI_File_close(&file_handle); ❹
76 file_offset = 0;
77 }
78
79 MPI_File open_mpi_io_file(const char *filename, MPI_Comm mpi_io_comm){
80 int file_mode = MPI_MODE_RDONLY | MPI_MODE_UNIQUE_OPEN;
81
82 MPI_Info mpi_info = MPI_INFO_NULL; // For MPI IO hints
83 MPI_Info_create(&mpi_info);
84 MPI_Info_set(mpi_info, "collective_buffering", "1");
85
86 MPI_File file_handle = NULL;
87 MPI_File_open(mpi_io_comm, filename, file_mode, mpi_info,
&file_handle);
88 return file_handle;
89 }MPI_IO_Examples/mpi_io_block2d/mpi_io_file_ops.c
66 void read_mpi_io_file(const char *filename, double **data, int data_size,
67 MPI_Datatype memspace, MPI_Datatype filespace, MPI_Comm mpi_io_comm){
68 MPI_File file_handle = open_mpi_io_file( ❶
filename, mpi_io_comm); ❶
69
70 MPI_File_set_view(file_handle, file_offset, ❷
71 MPI_DOUBLE, filespace, "native", ❷
MPI_INFO_NULL); ❷
72 MPI_File_read_all(file_handle, ❸
&(data[0][0]), 1, memspace, ❸
MPI_STATUS_IGNORE); ❸
73 file_offset += data_size;
74
75 MPI_File_close(&file_handle); ❹
76 file_offset = 0;
77 }
78
79 MPI_File open_mpi_io_file(const char *filename, MPI_Comm mpi_io_comm){
80 int file_mode = MPI_MODE_RDONLY | MPI_MODE_UNIQUE_OPEN;
81
82 MPI_Info mpi_info = MPI_INFO_NULL; // For MPI IO hints
83 MPI_Info_create(&mpi_info);
84 MPI_Info_set(mpi_info, "collective_buffering", "1");
85
86 MPI_File file_handle = NULL;
87 MPI_File_open(mpi_io_comm, filename, file_mode, mpi_info,
&file_handle);
88 return file_handle;
89 }
读取操作需要的提示和设置比写入操作少。这是因为读取的某些设置是由文件确定的。到目前为止,这些 MPI-IO 文件操作已经以通用形式编写,可以针对任何问题调用。现在,让我们看一下下面清单中设置调用的主要应用程序代码。
The read operation requires fewer hints and settings than the write operation. This is because some of the settings for a read are determined from the file. So far, these MPI-IO file operations have been written in a general form that can be called for any problem. Now let’s take a look at the main application code in the following listing that sets up the calls.
Listing 16.4 Main application code
MPI_IO_Examples/mpi_io_block2d/mpi_io_block2d.c
9 int main(int argc, char *argv[])
10 {
11 MPI_Init(&argc, &argv);
12
13 int rank, nprocs;
14 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
15 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
16
17 // for multiple files, subdivide communicator and
// set colors for each set
18 MPI_Comm mpi_io_comm = MPI_COMM_NULL;
19 int nfiles = 1;
20 float ranks_per_file = (float)nprocs/(float)nfiles;
21 int color = (int)((float)rank/ranks_per_file);
22 MPI_Comm_split(MPI_COMM_WORLD, color, rank, &mpi_io_comm);
23 int nprocs_color, rank_color;
24 MPI_Comm_size(mpi_io_comm, &nprocs_color);
25 MPI_Comm_rank(mpi_io_comm, &rank_color);
26 int row_color = 1, col_color = rank_color;
27 MPI_Comm mpi_row_comm, mpi_col_comm;
28 MPI_Comm_split(mpi_io_comm, row_color, rank_color, &mpi_row_comm);
29 MPI_Comm_split(mpi_io_comm, col_color, rank_color, &mpi_col_comm);
30
31 // set the dimensions of our data array and the number of ghost cells
32 int ndim = 2, ng = 2, ny = 10, nx = 10;
33 int global_subsizes[] = {ny, nx};
34
35 int ny_offset = 0, nx_offset = 0;
36 MPI_Exscan(&nx, &nx_offset, 1, MPI_INT, MPI_SUM, mpi_row_comm);
37 MPI_Exscan(&ny, &ny_offset, 1, MPI_INT, MPI_SUM, mpi_col_comm);
38 int global_offsets[] = {ny_offset, nx_offset};
39
40 int ny_global, nx_global;
41 MPI_Allreduce(&nx, &nx_global, 1, MPI_INT, MPI_SUM, mpi_row_comm);
42 MPI_Allreduce(&ny, &ny_global, 1, MPI_INT, MPI_SUM, mpi_col_comm);
43 int global_sizes[] = {ny_global, nx_global};
44 int data_size = ny_global*nx_global;
45
46 double **data = (double **)malloc2D(ny+2*ng, nx+2*ng);
47 double **data_restore = (double **)malloc2D(ny+2*ng, nx+2*ng);
< ... skipping data initialization ... >
54
55 MPI_Datatype memspace = MPI_DATATYPE_NULL,
filespace = MPI_DATATYPE_NULL;
56 mpi_io_file_init(ng, global_sizes, ❶
global_subsizes, global_offsets, ❶
57 &memspace, &filespace); ❶
58
59 char filename[30];
60 if (ncolors > 1) {
61 sprintf(filename,"example_%02d.data",color);
62 } else {
63 sprintf(filename,"example.data");
64 }
65
66 // Do the computation and write out a sequence of files
67 write_mpi_io_file(filename, data, ❷
data_size, memspace, filespace, ❷
mpi_io_comm); ❷
68 // Read back the data for verifying the file operations
69 read_mpi_io_file(filename, data_restore, ❸
70 data_size, memspace, filespace, ❸
mpi_io_comm); ❸
71
72 mpi_io_file_finalize(&memspace, &filespace); ❹
73
< ... skipping verification code ... >
105
106 free(data);
107 free(data_restore);
108
109 MPI_Comm_free(&mpi_io_comm);
110 MPI_Comm_free(&mpi_row_comm);
111 MPI_Comm_free(&mpi_col_comm);
112 MPI_Finalize();
113 return 0;
114 }MPI_IO_Examples/mpi_io_block2d/mpi_io_block2d.c
9 int main(int argc, char *argv[])
10 {
11 MPI_Init(&argc, &argv);
12
13 int rank, nprocs;
14 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
15 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
16
17 // for multiple files, subdivide communicator and
// set colors for each set
18 MPI_Comm mpi_io_comm = MPI_COMM_NULL;
19 int nfiles = 1;
20 float ranks_per_file = (float)nprocs/(float)nfiles;
21 int color = (int)((float)rank/ranks_per_file);
22 MPI_Comm_split(MPI_COMM_WORLD, color, rank, &mpi_io_comm);
23 int nprocs_color, rank_color;
24 MPI_Comm_size(mpi_io_comm, &nprocs_color);
25 MPI_Comm_rank(mpi_io_comm, &rank_color);
26 int row_color = 1, col_color = rank_color;
27 MPI_Comm mpi_row_comm, mpi_col_comm;
28 MPI_Comm_split(mpi_io_comm, row_color, rank_color, &mpi_row_comm);
29 MPI_Comm_split(mpi_io_comm, col_color, rank_color, &mpi_col_comm);
30
31 // set the dimensions of our data array and the number of ghost cells
32 int ndim = 2, ng = 2, ny = 10, nx = 10;
33 int global_subsizes[] = {ny, nx};
34
35 int ny_offset = 0, nx_offset = 0;
36 MPI_Exscan(&nx, &nx_offset, 1, MPI_INT, MPI_SUM, mpi_row_comm);
37 MPI_Exscan(&ny, &ny_offset, 1, MPI_INT, MPI_SUM, mpi_col_comm);
38 int global_offsets[] = {ny_offset, nx_offset};
39
40 int ny_global, nx_global;
41 MPI_Allreduce(&nx, &nx_global, 1, MPI_INT, MPI_SUM, mpi_row_comm);
42 MPI_Allreduce(&ny, &ny_global, 1, MPI_INT, MPI_SUM, mpi_col_comm);
43 int global_sizes[] = {ny_global, nx_global};
44 int data_size = ny_global*nx_global;
45
46 double **data = (double **)malloc2D(ny+2*ng, nx+2*ng);
47 double **data_restore = (double **)malloc2D(ny+2*ng, nx+2*ng);
< ... skipping data initialization ... >
54
55 MPI_Datatype memspace = MPI_DATATYPE_NULL,
filespace = MPI_DATATYPE_NULL;
56 mpi_io_file_init(ng, global_sizes, ❶
global_subsizes, global_offsets, ❶
57 &memspace, &filespace); ❶
58
59 char filename[30];
60 if (ncolors > 1) {
61 sprintf(filename,"example_%02d.data",color);
62 } else {
63 sprintf(filename,"example.data");
64 }
65
66 // Do the computation and write out a sequence of files
67 write_mpi_io_file(filename, data, ❷
data_size, memspace, filespace, ❷
mpi_io_comm); ❷
68 // Read back the data for verifying the file operations
69 read_mpi_io_file(filename, data_restore, ❸
70 data_size, memspace, filespace, ❸
mpi_io_comm); ❸
71
72 mpi_io_file_finalize(&memspace, &filespace); ❹
73
< ... skipping verification code ... >
105
106 free(data);
107 free(data_restore);
108
109 MPI_Comm_free(&mpi_io_comm);
110 MPI_Comm_free(&mpi_row_comm);
111 MPI_Comm_free(&mpi_col_comm);
112 MPI_Finalize();
113 return 0;
114 }
❶ Initializes and sets up the data types
❹ Closes the file and frees the data types
此设置需要一些解释。此代码支持写出多个 MPI 数据文件的功能。这通常称为 NxM 文件写入,其中 N 个进程写出 M 个文件,其中 M 大于 1 但远小于进程数(图 16.4)。采用此技术的原因是,在较大的问题大小下,写入单个文件并不总是能很好地扩展。
This setup takes a little explanation. This code supports the ability to write out more than one MPI data file. This is commonly called NxM file writes where N processes write out M files and where M is greater than one but much smaller than the number of processes (figure 16.4). The reason for this technique is that at larger problem sizes, writing to a single file does not always scale well.
图 16.4 在大尺寸下,进程可以按颜色划分为多个通信组,以便它们写出到单独的文件中。子组的排名顺序与原始 communicator 中的排名顺序相同。
Figure 16.4 At large sizes, the processes can be broken up into communication groups by colors so they write out to separate files. The ranks of the subgroups are in the same order as the ranks in the original communicator.
我们可以按颜色将流程分成几组,如图 16.4 所示。在清单 16.4 的第 1722 行中,我们基于 M 颜色设置了一个新的 communicator,其中 M 是文件的数量。文件数量在第 19 行设置,我们的颜色在第 20 行和第 21 行计算。我们对 ranks_per_file 使用浮点类型来处理秩的不均匀划分。然后我们在颜色中获得新的排名。图 16.4 右侧的每个通信组都有 4,096 个进程或秩。排名的顺序与 global communication 组中的顺序相同。如果有多个文件,则文件名将在第 59-64 行包含颜色编号。此代码当前仅设置一种颜色,并且只写入一个文件,如图 16.4 左侧所示,但编写它是为了支持更多文件。
We can break up the processes into groups by colors as shown in figure 16.4. In lines 1722 in listing 16.4, we set up a new communicator based on M colors, where M is the number of files. The number of files is set on line 19 and our color is computed on lines 20 and 21. We use a floating-point type for ranks_per_file to handle an uneven division of the ranks. We then get our new rank within our color. Each communication group on the right side of figure 16.4 has 4,096 processes or ranks. The order of the ranks is the same as in the global communication group. If there is more than one file, the filenames include a color number on lines 59-64. This code currently only sets one color and only writes one file as shown on the left side of figure 16.4, but it is written to support more files.
我们还需要知道每个进程的起始 x 和 y 值在哪里。对于每个进程具有相同行数和列数的数据分解,计算只需要知道进程在全局集中的位置。但是当行数和列数因进程而异时,我们需要对低于我们位置的所有大小求和。正如我们之前在 Section 5.6 中讨论的那样,此操作是一种常见的并行模式,称为 scan。为了进行此计算,在第 22-34 行中,我们为每行和每列创建通信器。这些执行独占扫描操作以获取每个进程的 x 和 y 的起始位置。在这段代码中,我们只在 x 坐标方向上对数据进行分区,以使其更简单一些。数组子大小中的 global 和 process 大小在第 27-44 行中设置。这包括使用独占扫描计算的数据偏移量。
We also need to know where the starting x and y values are for each process. For data decompositions that have the same number of rows and columns for each process, the calculation only needs to know the location of the process in the global set. But when the number of rows and columns varies across processes, we need to sum all the sizes below our position. As we have previously discussed in section 5.6, this operation is a common parallel pattern called a scan. To do this calculation, in lines 22-34 we create communicators for each row and column. These perform an exclusive scan operation to get the starting location of x and y for each process. In this code, we only partition the data in the x-coordinate direction to keep it a little simpler. The global and process sizes in the array subsizes are set in lines 27-44. This includes the data offsets calculated using the exclusive scans.
现在我们已经有了有关数据分解的所有必要信息,我们可以在第 52 行调用 mpi_io_file_init 子例程来设置内存和文件系统布局的 MPI 数据类型。这只需要在启动时执行一次。然后,我们可以在第 63 行和第 65 行自由调用我们的子例程进行写入、write_mpi_io_file和读取read_mpi_ io_file。在运行期间,我们可以根据需要多次调用 These。在我们的示例代码中,我们然后验证读入的数据,将其与原始数据进行比较,并在发生错误时打印错误。最后,我们在单个进程上打开文件,并使用标准 C 二进制读取来显示数据在文件中的布局方式。这是通过按顺序从文件中读取每个值并将其打印出来来完成的。
Now that we have all the necessary information about the data decomposition, we can call our mpi_io_file_init subroutine on line 52 to set up the MPI data types for the memory and filesystem layout. This only has to be done once, at startup. We are then free to call our subroutines for writes, write_mpi_io_file, and reads, read_mpi_ io_file, on lines 63 and 65. We can call these as many times as needed during the run. In our example code, we then verify the data read in, compare it to the original data, and print an error if it occurs. Finally, we open the file on a single process and use a standard C binary read to show how the data is laid out in the file. This is done by reading each value from the file in sequential order and printing it out.
现在编译并运行示例。该版本是标准的 CMake 版本,我们将在四个处理器上运行它。
Now to compile and run the example. The build is a standard CMake build, and we’ll run it on four processors.
mkdir build && cd build cmake .. make mpirun -n 4 ./mpi_io_block2d
mkdir build && cd build cmake .. make mpirun -n 4 ./mpi_io_block2d
图 16.5 显示了每个处理器上 10×10 网格的标准 C 二进制读取的输出。
Figure 16.5 shows the output from a standard C binary read for the 10×10 grid on each processor.
图 16.5 MPI-IO 的小型二进制读取代码的输出显示了文件包含的内容。使用 MPI-IO,我们必须编写一个小的实用程序来检查文件内容。
Figure 16.5 Output from a small binary read code for the MPI-IO shows what the file contains. With MPI-IO, we had to write a small utility to check the file contents.
对于传统的数据文件格式,如果没有用于写入和读取文件的代码,数据就毫无意义。分层数据格式 (HDF) 版本 5 采用不同的方法。HDF5 提供了一种自描述的并行数据格式。HDF5 之所以称为自描述,是因为名称和特征与数据一起存储在文件中。在 HDF5 中,有了文件中包含的数据的描述,你不再需要源码,只需查询文件即可读取数据。HDF5 还具有一组丰富的命令行实用程序(例如 h5ls 和 h5dump),您可以使用它们来查询文件的内容。您会发现,在检查文件是否正确写入时,这些实用程序非常有用。
With traditional data file formats, the data is meaningless without the code that is used to write and read the file. The Hierarchical Data Format (HDF), version 5, takes a different approach. HDF5 provides a self-describing parallel data format. HDF5 is called self-describing because the name and characteristics are stored in the file with the data. In HDF5, with the description of the data contained in the file, you no longer need the source code and can read the data by just querying the file. HDF5 also has a rich set of command-line utilities (such as h5ls and h5dump) that you can use to query the contents of a file. You will find that the utilities are useful when checking that your files are properly written.
由于速度和精度,我们希望以二进制格式写入数据。但是因为它是二进制格式,所以很难检查数据是否正确写入。如果我们重新读取数据,问题也可能出在读取过程中。可以查询文件的实用程序提供了一种将写入操作与读取分开检查的方法。在上一节关于 MPI-IO 的图 16.4 中,我们需要一个小程序来读取文件的内容。对于 HDF5,这是不必要的,因为已经提供了该实用程序。在图 16.6 中(本节后面显示),我们使用 h5dump 命令行工具查看内容。通过使用现有的 HDF5 实用程序,您可以避免为许多常见操作编写代码。
We want to write data in binary format because of speed and precision. But because it is in binary format, it is difficult to check if the data is correctly written. If we read the data back in, the problem could be in the reading process as well. A utility that can query the file provides a way to check the write operation separately from the read. In figure 16.4 in the previous section on MPI-IO, we needed a small program to read the contents of the file. For HDF5, it is unnecessary because the utility is already provided. In figure 16.6 (shown later in this section), we used the h5dump command-line utility to look at the contents. You can avoid the need to write code for many common operations by using the already existing HDF5 utilities.
并行 HDF5 代码是使用 MPI-IO 实现的。因为它是基于 MPI-IO 构建的,所以 HDF5 的结构是相似的。尽管相似,但术语和各个函数调用不同,足以造成一些困难。我们将介绍编写与 MPI-IO 类似的并行文件处理例程所需的函数。HDF5 库分为较低级别的功能分组。这些功能组可以通过组中所有调用的前缀方便地进行区分。第一组是强制性文件处理操作(表 16.4),它们共同处理文件打开和关闭操作。
The parallel HDF5 code is implemented by using MPI-IO. Because it is built on MPI-IO, the structure of HDF5 is similar. Although similar, the terminology and individual function calls are different enough to cause some difficulty. We’ll cover the functions that are needed to write a similar parallel file-handling routine as we did for MPI-IO. The HDF5 library is divided into lower-level functionality groupings. These functional groups are conveniently distinguished by prefixes for all of the calls in the group. The first group is the obligatory file handling operations (table 16.4) that collectively handle file open and close operations.
Table 16.4 HDF5 collective file routines
|
Collective file 打开,如果文件不存在,将创建该文件 Collective file open that will create the file if it doesn’t exist |
|
接下来,我们需要定义新的内存类型。这些用于指定要写入的数据部分及其布局。在 HDF5 中,这些内存类型称为数据空间。表 16.5 中的数据空间操作包括从多维数组中提取模式的方法。您可以在本章末尾的进一步阅读部分 (16.7.1) 中找到有关许多其他例程的信息。
Next, we need to define new memory types. These are used to specify the portions of data to write and their layout. In HDF5, these memory types are called dataspaces. The dataspace operations in table 16.5 include ways to extract patterns from a multidimensional array. You can find information on the many additional routines in the further reading section at the end of the chapter (16.7.1).
Table 16.5 HDF5 dataspace routines
|
Creates a hyperslab region type of parts of a multidimensional array |
|
还有其他数据空间操作,包括基于点的操作,我们在此处未介绍。现在我们需要将这些数据空间应用于一组多维数组(表 16.6)。在 HDF5 中,多维数组称为数据集,它通常是多维数组或应用程序中的其他形式的数据。
There are other dataspace operations, including point-based operations, that we haven’t covered here. Now we need to apply these dataspaces to a set of multidimensional arrays (table 16.6). In HDF5, a multidimensional array is called a dataset, which is generally a multidimensional array or some other form of data within the application.
Table 16.6 HDF5 dataset routines
我们只剩下一个需要的操作组。这个组称为 property lists,为您提供了一种修改操作或向操作提供提示的方法,如 table 16.7 所示。我们可以使用属性列表来设置属性,以将集合操作与读取或写入一起使用。属性列表还可用于将提示传递给底层 MPI-IO 库。
There is only one operation group left that we need. This group, called property lists, gives you a way to modify or supply hints to operations as table 16.7 shows. We can use property lists for setting attributes to use collective operations with reads or writes. Property lists can also be used to pass hints to the underlying MPI-IO library.
Table 16.7 HDF5 property list routines
让我们继续看一个例子。我们从创建文件和内存数据空间的代码开始这个 HDF5 示例。下面的清单显示了此过程。HDF5 的所有参数在列表中都以粗体显示。
Let’s move on to an example. We start this HDF5 example with the code to create the file and the memory dataspaces. The following listing shows this process. All the arguments to HDF5 are bolded in the listing.
Listing 16.5 Setting up HDF5 dataspace types
HDF5Examples/hdf5block2d/hdf5_file_ops.c
11 void hdf5_file_init(int ng, int ndims, int ny_global, int nx_global,
12 int ny, int nx, int ny_offset, int nx_offset, MPI_Comm mpi_hdf5_comm,
13 hid_t *memspace, hid_t *filespace){
14 // create data descriptors on disk and in memory
15 *filespace = create_hdf5_filespace(ndims, ❶
ny_global, nx_global, ny, nx, ❶
16 ny_offset, nx_offset, mpi_hdf5_comm); ❶
17 *memspace =
create_hdf5_memspace(ndims ny, nx, ng); ❷
18 }
19
20 hid_t create_hdf5_filespace(int ndims, int ny_global, int nx_global,
21 int ny, int nx, int ny_offset, int nx_offset,
MPI_Comm mpi_hdf5_comm){
22 // create the dataspace for data stored on disk
// using the hyperslab call
23 hsize_t dims[] = {ny_global, nx_global};
24
25 hid_t filespace = H5Screate_simple(ndims, ❸
dims, NULL); ❸
26
27 // determine the offset into the filespace for the current process
28 hsize_t start[] = {ny_offset, nx_offset};
29 hsize_t stride[] = {1, 1};
30 hsize_t count[] = {ny, nx};
31
32 H5Sselect_hyperslab(filespace, H5S_SELECT_SET, ❹
33 start, stride, count, NULL); ❹
34 return filespace;
35 }
36
37 hid_t create_hdf5_memspace(int ndims, int ny, int nx, int ng) {
38 // create a memory space in memory using the hyperslab call
39 hsize_t dims[] = {ny+2*ng, nx+2*ng};
40
41 hid_t memspace = H5Screate_simple(ndims, dims, NULL); ❺
42
43 // select the real data out of the array
44 hsize_t start[] = {ng, ng};
45 hsize_t stride[] = {1, 1};
46 hsize_t count[] = {ny, nx};
47
48 H5Sselect_hyperslab(memspace, H5S_SELECT_SET, ❻
49 start, stride, count, NULL); ❻
50 return memspace;
51 }
52
53 void hdf5_file_finalize(hid_t *memspace, hid_t *filespace){
54 H5Sclose(*memspace);
55 *memspace = H5S_NULL;
56 H5Sclose(*filespace);
57 *filespace = H5S_NULL;
58 }HDF5Examples/hdf5block2d/hdf5_file_ops.c
11 void hdf5_file_init(int ng, int ndims, int ny_global, int nx_global,
12 int ny, int nx, int ny_offset, int nx_offset, MPI_Comm mpi_hdf5_comm,
13 hid_t *memspace, hid_t *filespace){
14 // create data descriptors on disk and in memory
15 *filespace = create_hdf5_filespace(ndims, ❶
ny_global, nx_global, ny, nx, ❶
16 ny_offset, nx_offset, mpi_hdf5_comm); ❶
17 *memspace =
create_hdf5_memspace(ndims ny, nx, ng); ❷
18 }
19
20 hid_t create_hdf5_filespace(int ndims, int ny_global, int nx_global,
21 int ny, int nx, int ny_offset, int nx_offset,
MPI_Comm mpi_hdf5_comm){
22 // create the dataspace for data stored on disk
// using the hyperslab call
23 hsize_t dims[] = {ny_global, nx_global};
24
25 hid_t filespace = H5Screate_simple(ndims, ❸
dims, NULL); ❸
26
27 // determine the offset into the filespace for the current process
28 hsize_t start[] = {ny_offset, nx_offset};
29 hsize_t stride[] = {1, 1};
30 hsize_t count[] = {ny, nx};
31
32 H5Sselect_hyperslab(filespace, H5S_SELECT_SET, ❹
33 start, stride, count, NULL); ❹
34 return filespace;
35 }
36
37 hid_t create_hdf5_memspace(int ndims, int ny, int nx, int ng) {
38 // create a memory space in memory using the hyperslab call
39 hsize_t dims[] = {ny+2*ng, nx+2*ng};
40
41 hid_t memspace = H5Screate_simple(ndims, dims, NULL); ❺
42
43 // select the real data out of the array
44 hsize_t start[] = {ng, ng};
45 hsize_t stride[] = {1, 1};
46 hsize_t count[] = {ny, nx};
47
48 H5Sselect_hyperslab(memspace, H5S_SELECT_SET, ❻
49 start, stride, count, NULL); ❻
50 return memspace;
51 }
52
53 void hdf5_file_finalize(hid_t *memspace, hid_t *filespace){
54 H5Sclose(*memspace);
55 *memspace = H5S_NULL;
56 H5Sclose(*filespace);
57 *filespace = H5S_NULL;
58 }
❷ Creates the memory dataspace
❸ Creates the filespace object
❹ Selects the filespace hyperslab
❻ Creates the memspace hyperslab
在列表 16.5 中,我们在创建两个数据空间时使用了相同的模式: 创建数据对象,设置数据大小参数,然后选择数组的矩形区域。首先,我们使用 H5Screate_simple 调用创建了全局数组空间。对于文件数据空间,我们在第 23 行将维度设置为 nx_global 和 ny_global 的全局数组大小,然后在第 25 行使用这些大小来创建数据空间。然后,我们在第 32 行和第 48 行使用 H5Sselect _hyperslab调用为每个处理器选择文件数据空间的一个区域。然后对内存数据空间执行类似的过程。
In listing 16.5, we used the same pattern when creating the two dataspaces: create the data object, set the data size arguments, and then select a rectangular region of the array. First, we created the global array space with the H5Screate_simple call. For the file dataspace, we set the dimensions to the global array size of nx_global and ny_global on line 23 and then used those sizes on line 25 to create the dataspace. We then selected a region of the file dataspace for each processor with the H5Sselect _hyperslab calls on lines 32 and 48. A similar process is then done for the memory dataspace.
现在我们有了数据空间,将数据写出到文件中的过程非常简单。我们打开文件,创建数据集,然后写入它。如果有更多数据集,我们将继续写出这些数据集,完成后,我们关闭文件。下面的清单显示了如何完成此操作。
Now that we have the dataspaces, the process of writing out the data into the file is straightforward. We open the file, create the dataset, and write it. If there are more datasets, we continue to write these out, and when finished, we close the file. The following listing shows how this is done.
Listing 16.6 Writing to an HDF5 file
HDF5Examples/hdf5block2d/hdf5_file_ops.c
60 void write_hdf5_file(const char *filename, double **data1,
61 hid_t memspace, hid_t filespace, MPI_Comm mpi_hdf5_comm) {
62 hid_t file_identifier = create_hdf5_file( ❶
filename, mpi_hdf5_comm); ❶
63
64 // Create property list for collective dataset write.
65 hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
66 H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);
67
68 hid_t dataset1 = create_hdf5_dataset( ❷
file_identifier, filespace); ❷
69 //hid_t dataset2 = create_hdf5_dataset(file_identifier, filespace);
70
71 // write the data to disk using both the memory space
// and the data space.
72 H5Dwrite(dataset1, H5T_IEEE_F64LE, ❸
memspace, filespace, xfer_plist, ❸
73 &(data1[0][0])); ❸
74 //H5Dwrite(dataset2, H5T_IEEE_F64LE,
// memspace, filespace, xfer_plist,
75 // &(data2[0][0]));
76
77 H5Dclose(dataset1);
78 //H5Dclose(dataset2);
79
80 H5Pclose(xfer_plist);
81
82 H5Fclose(file_identifier); ❹
83 }
84
85 hid_t create_hdf5_file(const char *filename, MPI_Comm mpi_hdf5_comm){
86 hid_t file_creation_plist = H5P_DEFAULT; ❺
87 // set the file access template for parallel IO access
88 hid_t file_access_plist = H5P_DEFAULT; ❻
89 file_access_plist = H5Pcreate(H5P_FILE_ACCESS);
90
91 // set collective mode for metadata writes
92 H5Pset_coll_metadata_write(file_access_plist, true);
93
94 MPI_Info mpi_info = MPI_INFO_NULL; ❼
95 MPI_Info_create(&mpi_info);
96 MPI_Info_set(mpi_info, "striping_factor", "8");
97 MPI_Info_set(mpi_info, "striping_unit", "4194304");
98
99 // tell the HDF5 library that we want to use MPI-IO to do the writing
100 H5Pset_fapl_mpio(file_access_plist, mpi_hdf5_comm, mpi_info);
101
102 // Open the file collectively
103 // H5F_ACC_TRUNC - overwrite existing file.
// H5F_ACC_EXCL - no overwrite
104 // 3rd argument is file creation property list. Using default here
105 // 4th argument is the file access property list identifier
106 hid_t file_identifier = H5Fcreate(filename, ❽
107 H5F_ACC_TRUNC, file_creation_plist, ❽
file_access_plist); ❽
108
109 // release the file access template
110 H5Pclose(file_access_plist);
111 MPI_Info_free(&mpi_info);
112
113 return file_identifier;
114 }
115
116 hid_t create_hdf5_dataset(hid_t file_identifier, hid_t filespace){
117 // create the dataset
118 hid_t link_creation_plist = H5P_DEFAULT; ❾
119 hid_t dataset_creation_plist = H5P_DEFAULT; ❿
120 hid_t dataset_access_plist = H5P_DEFAULT; ⓫
121 hid_t dataset = H5Dcreate2( ⓬
122 file_identifier, // Arg 1: file identifier
123 "data array", // Arg 2: dataset name
124 H5T_IEEE_F64LE, // Arg 3: datatype identifier
125 filespace, // Arg 4: filespace identifier
126 link_creation_plist, // Arg 5: link creation property list
127 dataset_creation_plist, // Arg 6: dataset creation property list
128 dataset_access_plist); // Arg 7: dataset access property list
129
130 return dataset;
131 }HDF5Examples/hdf5block2d/hdf5_file_ops.c
60 void write_hdf5_file(const char *filename, double **data1,
61 hid_t memspace, hid_t filespace, MPI_Comm mpi_hdf5_comm) {
62 hid_t file_identifier = create_hdf5_file( ❶
filename, mpi_hdf5_comm); ❶
63
64 // Create property list for collective dataset write.
65 hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
66 H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);
67
68 hid_t dataset1 = create_hdf5_dataset( ❷
file_identifier, filespace); ❷
69 //hid_t dataset2 = create_hdf5_dataset(file_identifier, filespace);
70
71 // write the data to disk using both the memory space
// and the data space.
72 H5Dwrite(dataset1, H5T_IEEE_F64LE, ❸
memspace, filespace, xfer_plist, ❸
73 &(data1[0][0])); ❸
74 //H5Dwrite(dataset2, H5T_IEEE_F64LE,
// memspace, filespace, xfer_plist,
75 // &(data2[0][0]));
76
77 H5Dclose(dataset1);
78 //H5Dclose(dataset2);
79
80 H5Pclose(xfer_plist);
81
82 H5Fclose(file_identifier); ❹
83 }
84
85 hid_t create_hdf5_file(const char *filename, MPI_Comm mpi_hdf5_comm){
86 hid_t file_creation_plist = H5P_DEFAULT; ❺
87 // set the file access template for parallel IO access
88 hid_t file_access_plist = H5P_DEFAULT; ❻
89 file_access_plist = H5Pcreate(H5P_FILE_ACCESS);
90
91 // set collective mode for metadata writes
92 H5Pset_coll_metadata_write(file_access_plist, true);
93
94 MPI_Info mpi_info = MPI_INFO_NULL; ❼
95 MPI_Info_create(&mpi_info);
96 MPI_Info_set(mpi_info, "striping_factor", "8");
97 MPI_Info_set(mpi_info, "striping_unit", "4194304");
98
99 // tell the HDF5 library that we want to use MPI-IO to do the writing
100 H5Pset_fapl_mpio(file_access_plist, mpi_hdf5_comm, mpi_info);
101
102 // Open the file collectively
103 // H5F_ACC_TRUNC - overwrite existing file.
// H5F_ACC_EXCL - no overwrite
104 // 3rd argument is file creation property list. Using default here
105 // 4th argument is the file access property list identifier
106 hid_t file_identifier = H5Fcreate(filename, ❽
107 H5F_ACC_TRUNC, file_creation_plist, ❽
file_access_plist); ❽
108
109 // release the file access template
110 H5Pclose(file_access_plist);
111 MPI_Info_free(&mpi_info);
112
113 return file_identifier;
114 }
115
116 hid_t create_hdf5_dataset(hid_t file_identifier, hid_t filespace){
117 // create the dataset
118 hid_t link_creation_plist = H5P_DEFAULT; ❾
119 hid_t dataset_creation_plist = H5P_DEFAULT; ❿
120 hid_t dataset_access_plist = H5P_DEFAULT; ⓫
121 hid_t dataset = H5Dcreate2( ⓬
122 file_identifier, // Arg 1: file identifier
123 "data array", // Arg 2: dataset name
124 H5T_IEEE_F64LE, // Arg 3: datatype identifier
125 filespace, // Arg 4: filespace identifier
126 link_creation_plist, // Arg 5: link creation property list
127 dataset_creation_plist, // Arg 6: dataset creation property list
128 dataset_access_plist); // Arg 7: dataset access property list
129
130 return dataset;
131 }
❶ Calls the subroutine to create the file
❷ Calls the subroutine to create the dataset
❹ Closes the objects and the data file
❺ Creates file creation property list
❻ Creates file access property
❽ HDF5 routine creates the file.
❾ Creates the link creation property list
❿ Creates the dataset creation property list
⓫ Creates the dataset access property list
⓬ HDF5 routine creates the dataset.
在清单 16.6 中,主write_hdf5_file例程使用了我们在清单 16.5 中创建的文件空间数据空间。然后,我们在第 72 行使用 H5Dwrite 例程,使用 memspace 和 filespace 数据空间写出数据集。我们还创建并传入了一个属性列表,以告诉 HDF5 使用集体 MPI-IO 例程。最后,在第 82 行,我们关闭了文件。我们还关闭了前几行的属性 list 和 dataset 以避免内存泄漏。对于创建文件的例程,我们最终在第 106 行调用 H5Fcreate,但我们需要几行来设置提示。我们将集体写入和 MPI-IO 提示的属性列表设置与调用一起打包,并将它们放入单独的例程中。我们还对第 121 行的 HDF5 调用采用了相同的方法来创建数据集,以便我们可以详细说明您可以使用的不同属性列表。
In listing 16.6, the main write_hdf5_file routine uses the filespace dataspace that we created in listing 16.5. We then wrote out the dataset with the H5Dwrite routine on line 72, using both the memspace and filespace dataspaces. We also created and passed in a property list to tell HDF5 to use collective MPI-IO routines. Finally, on line 82, we closed the file. We also closed the property list and dataset on previous lines to avoid memory leaks. For the routine to create the file, we finally call H5Fcreate on line 106, but we need several lines to set up the hints. We wrapped the property list setup for the collective write and the MPI-IO hints along with the call and put these into a separate routine. We also took the same approach with the HDF5 call on line 121 for creating the dataset so we could detail the different property lists that you can use.
读取 HDF5 数据文件的例程(如下面的清单所示)与前面的写入操作具有相同的基本模式。此清单与清单 16.6 之间的最大区别在于需要的提示和属性更少。
The routine to read the HDF5 data file, shown in the following listing, has the same basic pattern as the earlier write operation. The biggest difference between this listing and listing 16.6 is that there are fewer hints and attributes needed.
Listing 16.7 Reading an HDF5 file
HDF5Examples/hdf5block2d/hdf5_file_ops.c
135 void read_hdf5_file(const char *filename, double **data1,
136 hid_t memspace, hid_t filespace, MPI_Comm mpi_hdf5_comm) {
137 hid_t file_identifier = ❶
open_hdf5_file(filename, mpi_hdf5_comm); ❶
138
139 // Create property list for collective dataset write.
140 hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
141 H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);
142
143 hid_t dataset1 =
open_hdf5_dataset(file_identifier); ❷
144 // read the data from disk using both the memory space
// and the data space.
145 H5Dread(dataset1, H5T_IEEE_F64LE, memspace, ❸
146 filespace, H5P_DEFAULT, &(data1[0][0])); ❸
147 H5Dclose(dataset1);
148
149 H5Pclose(xfer_plist);
150
151 H5Fclose(file_identifier); ❹
152 }
153
154 hid_t open_hdf5_file(const char *filename, MPI_Comm mpi_hdf5_comm){
155 // set the file access template for parallel IO access
156 hid_t file_access_plist = H5P_DEFAULT; // File access property list
157 file_access_plist = H5Pcreate(H5P_FILE_ACCESS);
158
159 // set collective mode for metadata reads (ops)
160 H5Pset_all_coll_metadata_ops(file_access_plist, true);
161
162 // tell the HDF5 library that we want to use MPI-IO to do the reading
163 H5Pset_fapl_mpio(file_access_plist, mpi_hdf5_comm, MPI_INFO_NULL);
164
165 // Open the file collectively
166 // H5F_ACC_RDONLY - sets access to read or write
// on open of an existing file.
167 // 3rd argument is the file access property list identifier
168 hid_t file_identifier = H5Fopen(filename, ❺
H5F_ACC_RDONLY, file_access_plist); ❺
169
170 // release the file access template
171 H5Pclose(file_access_plist);
172
173 return file_identifier;
174 }
175
176 hid_t open_hdf5_dataset(hid_t file_identifier){
177 // open the dataset
178 hid_t dataset_access_plist = H5P_DEFAULT; ❻
179 hid_t dataset = H5Dopen2( ❼
180 file_identifier, // Arg 1: file identifier
181 "data array", // Arg 2: dataset name to match for read
182 dataset_access_plist); // Arg 3: dataset access property list
183
184 return dataset;
185 }HDF5Examples/hdf5block2d/hdf5_file_ops.c
135 void read_hdf5_file(const char *filename, double **data1,
136 hid_t memspace, hid_t filespace, MPI_Comm mpi_hdf5_comm) {
137 hid_t file_identifier = ❶
open_hdf5_file(filename, mpi_hdf5_comm); ❶
138
139 // Create property list for collective dataset write.
140 hid_t xfer_plist = H5Pcreate(H5P_DATASET_XFER);
141 H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);
142
143 hid_t dataset1 =
open_hdf5_dataset(file_identifier); ❷
144 // read the data from disk using both the memory space
// and the data space.
145 H5Dread(dataset1, H5T_IEEE_F64LE, memspace, ❸
146 filespace, H5P_DEFAULT, &(data1[0][0])); ❸
147 H5Dclose(dataset1);
148
149 H5Pclose(xfer_plist);
150
151 H5Fclose(file_identifier); ❹
152 }
153
154 hid_t open_hdf5_file(const char *filename, MPI_Comm mpi_hdf5_comm){
155 // set the file access template for parallel IO access
156 hid_t file_access_plist = H5P_DEFAULT; // File access property list
157 file_access_plist = H5Pcreate(H5P_FILE_ACCESS);
158
159 // set collective mode for metadata reads (ops)
160 H5Pset_all_coll_metadata_ops(file_access_plist, true);
161
162 // tell the HDF5 library that we want to use MPI-IO to do the reading
163 H5Pset_fapl_mpio(file_access_plist, mpi_hdf5_comm, MPI_INFO_NULL);
164
165 // Open the file collectively
166 // H5F_ACC_RDONLY - sets access to read or write
// on open of an existing file.
167 // 3rd argument is the file access property list identifier
168 hid_t file_identifier = H5Fopen(filename, ❺
H5F_ACC_RDONLY, file_access_plist); ❺
169
170 // release the file access template
171 H5Pclose(file_access_plist);
172
173 return file_identifier;
174 }
175
176 hid_t open_hdf5_dataset(hid_t file_identifier){
177 // open the dataset
178 hid_t dataset_access_plist = H5P_DEFAULT; ❻
179 hid_t dataset = H5Dopen2( ❼
180 file_identifier, // Arg 1: file identifier
181 "data array", // Arg 2: dataset name to match for read
182 dataset_access_plist); // Arg 3: dataset access property list
183
184 return dataset;
185 }
❶ Calls the subroutine to open the file
❷ Calls the subroutine to create the dataset
❹ Closes the objects and the data file
❺ HDF5 routine opens the file.
❻ Creates dataset access property list
❼ HDF5 routine creates the dataset.
因为该文件已经存在,所以我们在清单 16.7 中的第 168 行使用 open 调用来指定只读模式。(使用只读模式允许进行其他优化。访问的文件已具有在写入期间指定的一些属性。其中一些属性不需要在 read 中指定。到目前为止,HDF5 列表可能包含应用程序中的通用库。下一个清单显示了将在主应用程序中的不同点放置的调用。
Because the file already exists, we use an open call on line 168 in listing 16.7 to specify read-only mode. (Using read-only mode allows additional optimizations.) The accessed file already has some attributes that were specified during the write. Some of these attributes do not need to be specified in the read. The HDF5 listings so far might comprise a general-purpose library within an application. The next listing shows the calls that would be placed at different points in the main application.
Listing 16.8 Main application file
HDF5Examples/hdf5block2d/hdf5block2d.c 52 hid_t memspace = H5S_NULL, filespace = H5S_NULL; 53 hdf5_file_init(ng, ndims, ny_global, ❶ nx_global, ny, nx, ny_offset, nx_offset, ❶ 54 mpi_hdf5_comm, &memspace, &filespace); ❶ 55 56 char filename[30]; 57 if (ncolors > 1) { 58 sprintf(filename,"example_%02d.hdf5",color); 59 } else { 60 sprintf(filename,"example.hdf5"); 61 } 62 63 // Do the computation and write out a sequence of files 64 write_hdf5_file(filename, data, memspace, ❷ filespace, mpi_hdf5_comm); ❷ 65 // Read back the data for verifying the file operations 66 read_hdf5_file(filename, data_restore, ❸ memspace, filespace, mpi_hdf5_comm); ❸ 67 68 hdf5_file_finalize(&memspace, &filespace); ❹
HDF5Examples/hdf5block2d/hdf5block2d.c 52 hid_t memspace = H5S_NULL, filespace = H5S_NULL; 53 hdf5_file_init(ng, ndims, ny_global, ❶ nx_global, ny, nx, ny_offset, nx_offset, ❶ 54 mpi_hdf5_comm, &memspace, &filespace); ❶ 55 56 char filename[30]; 57 if (ncolors > 1) { 58 sprintf(filename,"example_%02d.hdf5",color); 59 } else { 60 sprintf(filename,"example.hdf5"); 61 } 62 63 // Do the computation and write out a sequence of files 64 write_hdf5_file(filename, data, memspace, ❷ filespace, mpi_hdf5_comm); ❷ 65 // Read back the data for verifying the file operations 66 read_hdf5_file(filename, data_restore, ❸ memspace, filespace, mpi_hdf5_comm); ❸ 67 68 hdf5_file_finalize(&memspace, &filespace); ❹
❶ Sets up the memory and file dataspaces
❸ Reads in the data from the HDF5 data file
在示例 16.8 中,在第 53 行设置数据空间的初始化操作可以在程序开始时完成一次。然后,您可以定期写出程序中的数据,用于图形和检查点。然后,通常在运行开始时从检查点重新启动时完成读取。最后,在终止计算之前,应在程序结束时完成 finalize 调用。现在编译并运行示例。该版本是标准的 CMake 版本。我们将在四个处理器上运行它:
In listing 16.8, the initialization operation to set up the dataspaces on line 53 can be done once, at the start of your program. Then you might write out the data in your program at periodic intervals for graphics and checkpointing. The read would then typically be done when restarting from a checkpoint at the start of a run. Lastly, the finalize call should be done at the end of the program before terminating the calculation. Now to compile and run the example. The build is a standard CMake build. We’ll run it on four processors:
mkdir build && cd build cmake .. make mpirun -n 4 ./hdf5block2d
mkdir build && cd build cmake .. make mpirun -n 4 ./hdf5block2d
在一次安装中,HDF5 软件包可以作为并行版本或串行版本安装,但不能同时安装两者。一个常见的问题是将错误的版本链接到您的应用程序中。我们在 CMake 构建系统中添加了一些特殊代码,以优先选择并行版本,如下一个清单所示。如果 HDF5 版本不是并行的,则程序将失败,这样我们在构建过程中就不会收到错误。
In a single install, the HDF5 package can be either installed as a parallel or as a serial version but not both. A common problem is to link the wrong version into your application. We added some special code to the CMake build system to preferentially select a parallel version as the next listing shows. The program then fails if the HDF5 version is not parallel so that we don’t get an error during the build.
Listing 16.9 Checking for a parallel HDF5 package
HDF5Examples/hdf5block2d/CMakeLists.txt 14 set(HDF5_PREFER_PARALLEL true) 15 find_package(HDF5 1.10.1 REQUIRED) 16 if (NOT HDF5_IS_PARALLEL) 17 message(FATAL_ERROR " -- HDF5 version is not parallel.") 18 endif (NOT HDF5_IS_PARALLEL)
HDF5Examples/hdf5block2d/CMakeLists.txt 14 set(HDF5_PREFER_PARALLEL true) 15 find_package(HDF5 1.10.1 REQUIRED) 16 if (NOT HDF5_IS_PARALLEL) 17 message(FATAL_ERROR " -- HDF5 version is not parallel.") 18 endif (NOT HDF5_IS_PARALLEL)
示例代码执行验证测试,以检查从文件中读回的数据是否与我们开始时的数据相同。我们还可以使用 h5dump 实用程序来打印文件中的数据。您可以使用以下命令查看数据文件。图 16.6 显示了该命令的输出。
The example code does a verification test to check that the data read back from the file is the same as the data that we started with. We can also use the h5dump utility to print the data in the file. You can use the following command to look at your data file. Figure 16.6 shows the output from the command.
h5dump -y example.hdf5
h5dump -y example.hdf5
图 16.6 使用 h5dump 命令行实用程序可显示 HDF5 文件中包含的内容,而无需编写任何代码。
Figure 16.6 Using the h5dump command-line utility shows what is contained in the HDF5 file without having to write any code.
在本节中,我们将简要介绍几个更常见的并行文件软件包:PnetCDF 和 Adios。PnetCDF 是 Parallel Network Common Data Form 的缩写,是另一种自描述数据格式,在 Earth Systems 社区和美国国家科学基金会 (NSF) 资助的组织中很受欢迎。虽然最初是一个完全独立的软件源,但并行版本构建在 HDF5 和 MPI-IO 之上。使用 PnetCDF 还是 HDF5 的决定在很大程度上受您的社区影响。由于您的应用程序生成的文件经常被其他人使用,因此使用相同的数据标准非常重要。
In this section, we briefly cover a couple of the more common parallel file software packages: PnetCDF and Adios. PnetCDF, short for Parallel Network Common Data Form, is another self-describing data format that is popular in the Earth Systems community and among organizations funded by the National Science Foundation (NSF). While originally a completely separate software source, the parallel version is built on top of HDF5 and MPI-IO. The decision of whether to use PnetCDF or HDF5 is strongly influenced by your community. Because the files generated by your application are often used by others, using the same data standard is important.
ADIOS 或自适应输入/输出系统也是橡树岭国家实验室 (ORNL) 的一种自描述数据格式。ADIOS 有自己的原生二进制格式,但它也可以使用 HDF5、MPI-IO 和其他文件存储软件。
ADIOS, or the Adaptable Input/Output System, is also a self-describing data format from Oak Ridge National Laboratory (ORNL). ADIOS has its own native binary format, but it can also use HDF5, MPI-IO, and other file-storage software.
随着数据需求的增加,需要更复杂的文件系统。在本节中,我们将介绍这些并行文件系统。并行文件系统可以通过将操作分散到具有多个文件写入器或读取器的多个硬盘上来大大加快文件写入和读取速度。虽然我们现在在文件系统上有一些并行性,但这并不是一个简单的情况。应用程序并行度与文件系统提供的并行度之间仍然存在不匹配。因此,并行操作的管理很复杂,并且高度依赖于硬件配置和应用程序需求。为了处理复杂性,许多并行文件系统使用基于对象的文件结构。基于对象的文件系统非常适合应对这些挑战。但是并行文件系统的性能和健壮性通常受到描述文件数据位置的元数据的限制。
With increasing data demands, more complex filesystems become necessary. In this section, we will introduce these parallel filesystems. A parallel filesystem can greatly speed up file writes and reads by spreading out the operations across several hard disks with multiple file writers or readers. While we now have some parallelism at the filesystem, it is not a simple situation. There is still a mismatch between the application parallelism and the parallelism provided by the filesystem. Because of this, the management of the parallel operations is complex and highly dependent on the hardware configurations and application demands. To deal with the complexity, many of the parallel filesystems use an object-based file structure. Object-based filesystems are a natural fit for these challenges. But the performance and robustness of the parallel filesystem is often limited by the metadata describing the locations of the file data.
定义 基于对象的文件系统是一种基于对象而不是文件夹中的文件进行组织的系统。基于对象的文件系统需要一个数据库或元数据来存储描述对象的所有信息。
Definition Object-based filesystem is a system that’s organized based on objects rather than on files in a folder. An object-based filesystem requires a database or metadata to store all the information describing the object.
并行文件操作的写入与并行文件系统软件高度交织在一起。这需要知道正在使用的并行文件系统以及该安装和文件系统的可用设置。优化并行文件软件有时可以显著提高性能。
The writing of parallel file operations is highly intertwined with the parallel filesystem software. This requires the knowledge of which parallel filesystem is being used and the settings available for that installation and filesystem. Tuning your parallel file software can sometimes yield significant performance gains.
当您进入并行文件操作与文件系统的交互时,查看有关并行库设置的更多信息会很有帮助。可以为每个安装设置不同的设置。您还可以获取一些有助于调试性能问题的高级统计信息。
As you get into the interaction of the parallel file operations with the filesystem, it is helpful to see more information about the parallel library settings. The settings can be set differently for each installation. You can also get some high-level statistics that can help with debugging performance issues.
大多数 MPI-IO 库是两种实现之一,一种是 ROMIO(随 MPICH 和许多系统供应商实现一起分发)或 OMPIO(较新版本的 OpenMPI 上的默认值)。我们首先回顾一下如何从 OpenMPI 的 OMPIO 插件获取信息,或者如何切换回使用 ROMIO。要提取有关 OpenMPI 的 OMPIO 设置的信息,请使用以下命令:
Most MPI-IO libraries are one of two implementations, either ROMIO, which is distributed with MPICH and many system vendor implementations, or OMPIO, which is the default on newer versions of OpenMPI. Let’s first go over how to get information from OpenMPI’s OMPIO plugin or how to switch back to using ROMIO. To extract information on OpenMPI’s OMPIO settings, use the following commands:
指定 IO 插件,OMPIO 或 ROMIO。旧版本使用 ROMIO 作为默认插件,而 OMPIO 是新版本的默认插件。
Specifies the IO plugin, either OMPIO or ROMIO. Older releases use ROMIO as the default plugin, while OMPIO is the default on newer releases.
Displays information on the local OpenMPI configuration for that plugin.
首先,您可以使用 ompi_info 命令获取 IO 插件的名称。我们只想要 IO 组件插件,因此我们过滤这些插件的输出:
First, you can get the names of the IO plugins with the ompi_info command. We just want the IO component plugins, so we filter the output for these:
ompi_info |grep "MCA io:" MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.3) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.3)
ompi_info |grep "MCA io:" MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.3) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.3)
然后,您可以获得每个插件可用的单独设置。使用 ompi_info 命令,我们得到以下简短输出:
Then you can get the individual settings available for each plugin. Using the ompi_info command, we get the following abbreviated output:
ompi_info --param io ompio --level 9 | grep ": parameter" MCA io ompio: parameter "io_ompio_priority" (current value: "30" ... MCA io ompio: parameter "io_ompio_delete_priority" (current value: "30" ... MCA io ompio: parameter "io_ompio_record_file_offset_info" (current value: "0" ... MCA io ompio: parameter "io_ompio_coll_timing_info" (current value: "1" ... MCA io ompio: parameter "io_ompio_cycle_buffer_size" (current value: "536870912" ... MCA io ompio: parameter "io_ompio_bytes_per_agg" (current value: "33554432" ... MCA io ompio: parameter "io_ompio_num_aggregators" (current value: "-1" ... MCA io ompio: parameter "io_ompio_grouping_option" (current value: "5" ... MCA io ompio: parameter "io_ompio_max_aggregators_ratio" (current value: "8" ... MCA io ompio: parameter "io_ompio_aggregators_cutoff_threshold" (current value: "3" ... MCA io ompio: parameter "io_ompio_overwrite_amode" (current value: "1" ... MCA io ompio: parameter "io_ompio_verbose_info_parsing" (current value: "0" ...
ompi_info --param io ompio --level 9 | grep ": parameter" MCA io ompio: parameter "io_ompio_priority" (current value: "30" ... MCA io ompio: parameter "io_ompio_delete_priority" (current value: "30" ... MCA io ompio: parameter "io_ompio_record_file_offset_info" (current value: "0" ... MCA io ompio: parameter "io_ompio_coll_timing_info" (current value: "1" ... MCA io ompio: parameter "io_ompio_cycle_buffer_size" (current value: "536870912" ... MCA io ompio: parameter "io_ompio_bytes_per_agg" (current value: "33554432" ... MCA io ompio: parameter "io_ompio_num_aggregators" (current value: "-1" ... MCA io ompio: parameter "io_ompio_grouping_option" (current value: "5" ... MCA io ompio: parameter "io_ompio_max_aggregators_ratio" (current value: "8" ... MCA io ompio: parameter "io_ompio_aggregators_cutoff_threshold" (current value: "3" ... MCA io ompio: parameter "io_ompio_overwrite_amode" (current value: "1" ... MCA io ompio: parameter "io_ompio_verbose_info_parsing" (current value: "0" ...
您还可以使用以下运行时选项验证 MPI-IO 库如何解释 MPI_Info_set 调用。这是检查是否为文件系统和并行文件操作库正确编写代码的好方法。
You can also verify how the MPI_Info_set calls are interpreted by the MPI-IO library with the following run-time option. This can be a good way to check that your code is correctly written for your filesystem and parallel file operation libraries.
mpirun --mca io_ompio_verbose_info_parsing 1 -n 4 ./mpi_io_block2d File: example.data info: collective_buffering value true enforcing using individual fcoll component < ... repeated three more times ... >
mpirun --mca io_ompio_verbose_info_parsing 1 -n 4 ./mpi_io_block2d File: example.data info: collective_buffering value true enforcing using individual fcoll component < ... repeated three more times ... >
对于 MPICH 附带的 ROMIO 并行文件软件,我们有不同的机制来查询软件安装。Cray 为其 ROMIO 实现添加了一些额外的环境变量。我们将列出其中的一些,然后查看使用这些示例。
For the ROMIO parallel file software included with MPICH, we have different mechanisms to query the software installation. Cray adds some additional environment variables for their implementations of ROMIO. We’ll list some of these and then see examples that use these.
使用 ROMIO_PRINT_HINTS 时的输出如下所示:
The following shows the output when using ROMIO_PRINT_HINTS:
export ROMIO_PRINT_HINTS=1; mpirun -n 4 ./mpi_io_block2d
key = cb_buffer_size value = 16777216
key = romio_cb_read value = automatic
key = romio_cb_write value = automatic
key = cb_nodes value = 1
key = romio_no_indep_rw value = false
key = romio_cb_pfr value = disable
key = romio_cb_fr_types value = aar
key = romio_cb_fr_alignment value = 1
key = romio_cb_ds_threshold value = 0
key = romio_cb_alltoall value = automatic
key = ind_rd_buffer_size value = 4194304
key = ind_wr_buffer_size value = 524288
key = romio_ds_read value = automatic
key = romio_ds_write value = automatic
key = striping_unit value = 4194304
key = cb_config_list value = *:1
key = romio_filesystem_type value = NFS:
key = romio_aggregator_list value = 0
key = cb_buffer_size value = 16777216
key = romio_cb_read value = automatic
key = romio_cb_write value = automatic
key = cb_nodes value = 1
key = romio_no_indep_rw value = false
key = romio_cb_pfr value = disable
key = romio_cb_fr_types value = aar
key = romio_cb_fr_alignment value = 1
key = romio_cb_ds_threshold value = 0
key = romio_cb_alltoall value = automatic
key = ind_rd_buffer_size value = 4194304
key = ind_wr_buffer_size value = 524288
key = romio_ds_read value = automatic
key = romio_ds_write value = automatic
key = cb_config_list value = *:1
key = romio_filesystem_type value = NFS:
key = romio_aggregator_list value = 0
export MPICH_MPIIO_HINTS_DISPLAY=1; srun -n 4 ./mpi_io_block2d
PE 0: MPICH MPIIO environment settings:
PE 0: MPICH_MPIIO_HINTS_DISPLAY = 1
PE 0: MPICH_MPIIO_HINTS = NULL
PE 0: MPICH_MPIIO_ABORT_ON_RW_ERROR = disable
PE 0: MPICH_MPIIO_CB_ALIGN = 2
PE 0: MPICH_MPIIO_DVS_MAXNODES = -1
PE 0: MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY = 0
PE 0: MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE = -1
PE 0: MPICH_MPIIO_MAX_NUM_IRECV = 50
PE 0: MPICH_MPIIO_MAX_NUM_ISEND = 50
PE 0: MPICH_MPIIO_MAX_SIZE_ISEND = 10485760
PE 0: MPICH MPIIO statistics environment settings:
PE 0: MPICH_MPIIO_STATS = 0
PE 0: MPICH_MPIIO_TIMERS = 0
PE 0: MPICH_MPIIO_WRITE_EXIT_BARRIER = 1
MPIIO WARNING: DVS stripe width of 8 was requested but DVS set it to 1
See MPICH_MPIIO_DVS_MAXNODES in the intro_mpi man page.
PE 0: MPIIO hints for example.data:
cb_buffer_size = 16777216
romio_cb_read = automatic
romio_cb_write = automatic
cb_nodes = 1
cb_align = 2
romio_no_indep_rw = false
romio_cb_pfr = disable
romio_cb_fr_types = aar
romio_cb_fr_alignment = 1
romio_cb_ds_threshold = 0
romio_cb_alltoall = automatic
ind_rd_buffer_size = 4194304
ind_wr_buffer_size = 524288
romio_ds_read = disable
romio_ds_write = automatic
striping_factor = 1
striping_unit = 4194304
direct_io = false
aggregator_placement_stride = -1
abort_on_rw_error = disable
cb_config_list = *:*
romio_filesystem_type = CRAY ADIO:
export MPICH_MPIIO_STATS=1; srun -n 4 ./mpi_io_block2d
+--------------------------------------------------------+
| MPIIO write access patterns for example.data
| independent writes = 0
| collective writes = 4
| independent writers = 0
| aggregators = 1
| stripe count = 1
| stripe size = 4194304
| system writes = 2
| stripe sized writes = 0
| aggregators active = 4,0,0,0 (1, <= 1, > 1, 1)
| total bytes for writes = 3600
| ave system write size = 1800
| read-modify-write count = 0
| read-modify-write bytes = 0
| number of write gaps = 0
| ave write gap size = NA
| See "Optimizing MPI I/O on Cray XE Systems" S-0013-20 for explanations.
+--------------------------------------------------------+
+--------------------------------------------------------+
| MPIIO read access patterns for example.data
| independent reads = 0
| collective reads = 4
| independent readers = 0
| aggregators = 1
| stripe count = 1
| stripe size = 524288
| system reads = 1
| stripe sized reads = 0
| total bytes for reads = 3200
| ave system read size = 3200
| number of read gaps = 0
| ave read gap size = NA
| See "Optimizing MPI I/O on Cray XE Systems" S-0013-20 for explanations.
+--------------------------------------------------------+export ROMIO_PRINT_HINTS=1; mpirun -n 4 ./mpi_io_block2d
key = cb_buffer_size value = 16777216
key = romio_cb_read value = automatic
key = romio_cb_write value = automatic
key = cb_nodes value = 1
key = romio_no_indep_rw value = false
key = romio_cb_pfr value = disable
key = romio_cb_fr_types value = aar
key = romio_cb_fr_alignment value = 1
key = romio_cb_ds_threshold value = 0
key = romio_cb_alltoall value = automatic
key = ind_rd_buffer_size value = 4194304
key = ind_wr_buffer_size value = 524288
key = romio_ds_read value = automatic
key = romio_ds_write value = automatic
key = striping_unit value = 4194304
key = cb_config_list value = *:1
key = romio_filesystem_type value = NFS:
key = romio_aggregator_list value = 0
key = cb_buffer_size value = 16777216
key = romio_cb_read value = automatic
key = romio_cb_write value = automatic
key = cb_nodes value = 1
key = romio_no_indep_rw value = false
key = romio_cb_pfr value = disable
key = romio_cb_fr_types value = aar
key = romio_cb_fr_alignment value = 1
key = romio_cb_ds_threshold value = 0
key = romio_cb_alltoall value = automatic
key = ind_rd_buffer_size value = 4194304
key = ind_wr_buffer_size value = 524288
key = romio_ds_read value = automatic
key = romio_ds_write value = automatic
key = cb_config_list value = *:1
key = romio_filesystem_type value = NFS:
key = romio_aggregator_list value = 0
export MPICH_MPIIO_HINTS_DISPLAY=1; srun -n 4 ./mpi_io_block2d
PE 0: MPICH MPIIO environment settings:
PE 0: MPICH_MPIIO_HINTS_DISPLAY = 1
PE 0: MPICH_MPIIO_HINTS = NULL
PE 0: MPICH_MPIIO_ABORT_ON_RW_ERROR = disable
PE 0: MPICH_MPIIO_CB_ALIGN = 2
PE 0: MPICH_MPIIO_DVS_MAXNODES = -1
PE 0: MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY = 0
PE 0: MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE = -1
PE 0: MPICH_MPIIO_MAX_NUM_IRECV = 50
PE 0: MPICH_MPIIO_MAX_NUM_ISEND = 50
PE 0: MPICH_MPIIO_MAX_SIZE_ISEND = 10485760
PE 0: MPICH MPIIO statistics environment settings:
PE 0: MPICH_MPIIO_STATS = 0
PE 0: MPICH_MPIIO_TIMERS = 0
PE 0: MPICH_MPIIO_WRITE_EXIT_BARRIER = 1
MPIIO WARNING: DVS stripe width of 8 was requested but DVS set it to 1
See MPICH_MPIIO_DVS_MAXNODES in the intro_mpi man page.
PE 0: MPIIO hints for example.data:
cb_buffer_size = 16777216
romio_cb_read = automatic
romio_cb_write = automatic
cb_nodes = 1
cb_align = 2
romio_no_indep_rw = false
romio_cb_pfr = disable
romio_cb_fr_types = aar
romio_cb_fr_alignment = 1
romio_cb_ds_threshold = 0
romio_cb_alltoall = automatic
ind_rd_buffer_size = 4194304
ind_wr_buffer_size = 524288
romio_ds_read = disable
romio_ds_write = automatic
striping_factor = 1
striping_unit = 4194304
direct_io = false
aggregator_placement_stride = -1
abort_on_rw_error = disable
cb_config_list = *:*
romio_filesystem_type = CRAY ADIO:
export MPICH_MPIIO_STATS=1; srun -n 4 ./mpi_io_block2d
+--------------------------------------------------------+
| MPIIO write access patterns for example.data
| independent writes = 0
| collective writes = 4
| independent writers = 0
| aggregators = 1
| stripe count = 1
| stripe size = 4194304
| system writes = 2
| stripe sized writes = 0
| aggregators active = 4,0,0,0 (1, <= 1, > 1, 1)
| total bytes for writes = 3600
| ave system write size = 1800
| read-modify-write count = 0
| read-modify-write bytes = 0
| number of write gaps = 0
| ave write gap size = NA
| See "Optimizing MPI I/O on Cray XE Systems" S-0013-20 for explanations.
+--------------------------------------------------------+
+--------------------------------------------------------+
| MPIIO read access patterns for example.data
| independent reads = 0
| collective reads = 4
| independent readers = 0
| aggregators = 1
| stripe count = 1
| stripe size = 524288
| system reads = 1
| stripe sized reads = 0
| total bytes for reads = 3200
| ave system read size = 3200
| number of read gaps = 0
| ave read gap size = NA
| See "Optimizing MPI I/O on Cray XE Systems" S-0013-20 for explanations.
+--------------------------------------------------------+
有时,提供有关您将在应用程序中使用的文件操作类型的提示很有用。您可以使用环境变量、提示文件或在运行时使用 MPI_Info_set 修改并行文件设置。如果您无权访问程序源以添加 MPI_Info_set 命令,这将为处理不同的情况提供适当的方法。要在这种情况下设置并行文件选项,请使用以下命令:
It is sometimes useful to give some hints about the type of file operations you will use in your application. You can modify the parallel file settings with environment variables, a hints file, or at run time with MPI_Info_set. This provides the appropriate method for handling different scenarios if you don’t have access to the program source to add the MPI_Info_set command. To set parallel file options in this case, use the following commands:
MPICH_MPIIO_HINTS=”*:<key>=<value>:<key>=<value>
MPICH_MPIIO_HINTS=”*:<key>=<value>:<key>=<value>
export MPICH_MPIIO_HINTS=\ ”*:striping_factor=8:striping_unit=4194304”
export MPICH_MPIIO_HINTS=\ ”*:striping_factor=8:striping_unit=4194304”
ROMIO_HINTS=<filename>
ROMIO_HINTS=<filename>
For example: ROMIO_HINTS=romio-hints
where the romio-hints file includes
striping_factor 8 // file is broken into 8 parts and
// is written in parallel to 8 disks
striping_unit 4194304 // the size in bytes of each
// block to be writtenstriping_factor 8 // file is broken into 8 parts and
// is written in parallel to 8 disks
striping_unit 4194304 // the size in bytes of each
// block to be written
OMPI_MCA_<param_name> <value>
OMPI_MCA_<param_name> <value>
例如:export OMPI_MCA_io_ompio_verbose_info_parsing=1
For example: export OMPI_MCA_io_ompio_verbose_info_parsing=1
作为 mpirun 命令的参数的 OpenMPI mca 运行时选项是
The OpenMPI mca run-time option as an argument to the mpirun command is
mpirun—mca io_ompio_verbose_info_parsing 1 -n 4 <exec>
mpirun—mca io_ompio_verbose_info_parsing 1 -n 4 <exec>
OpenMPI 文件的默认位置位于 $HOME/.openmpi/mca-params.conf 中,也可以使用以下内容进行设置:
The default location of the OpenMPI file is in $HOME/.openmpi/mca-params.conf or it can be set with the following:
--tune <filename> mpirun --tune mca-params.conf -n 2 <exec>
--tune <filename> mpirun --tune mca-params.conf -n 2 <exec>
您可以设置的最重要的提示是使用集体操作还是数据筛选。我们首先看一下集体操作,然后看一下数据筛选操作。
The most important hint that you can set is whether to use collective operations or data sieving. We’ll first look at the collective operations and then the data sieving operations.
集合操作利用 MPI 集合通信调用,并使用两阶段 I/O 方法为聚合器收集数据,然后聚合器从您的文件中写入或读取数据。使用以下命令进行集体 I/O:
Collective operations harness MPI collective communication calls and use a two-phase I/O approach that collects the data for aggregators that then write or read from your file. Use the following commands for collective I/O:
romio_cb_read=[enable|automatic|disable] specifies when to use collective buffering for reads.
romio_cb_write=[enable|automatic|disable] specifies when to use collective buffering for writes.
cb_config_list=*:<integer> sets the number of aggregators per node.
romio_no_indep_rw=[true|false] specifies whether to use any independent I/O. If none are allowed, no file operations (including file open) will be done on non-aggregator nodes.
数据筛选执行一次读取(或写入),跨越一个文件块,然后将数据打包到单个进程读取。这避免了大量较小的读取操作以及可能发生的文件读取器之间的争用。使用以下命令对 ROMIO 进行数据筛选:
Data sieving does a single read (or write), spanning a file block and then parcels out the data to the individual process reads. This avoids a lot of smaller reads and the contention between file readers that might occur. Use the following commands for data sieving with ROMIO:
某些提示仅适用于特定文件系统,例如 Lustre 或 GPFS。我们可以从程序中检测文件系统类型,并为文件系统设置适当的提示。示例中的 fs_detect.c 程序就是这样做的。该程序使用 statfs 命令,如下一个清单所示,您可以在本章的 examples 目录中找到它。
Some hints only apply to a particular filesystem, such as Lustre or GPFS. We can detect the filesystem type from within our program and set the appropriate hints for the filesystem. The fs_detect.c program in the examples does this. This program uses the statfs command as the next listing shows and you can find it in the examples directory for this chapter.
Listing 16.10 Filesystem detection program
MPI_IO_Examples/mpi_io_block2d/fs_detect.c 1 #include <stdio.h> 2 #ifdef __APPLE_CC__ 3 #include <sys/mount.h> 4 #else 5 #include <sys/statfs.h> 6 #endif 7 // Filesystem types are listed in the system // include directory in linux/magic.h 8 // You will need to add any additional // parallel filesystem magic codes 9 #define LUSTRE_MAGIC1 0x858458f6 ❶ 10 #define LUSTRE_MAGIC2 0xbd00bd0 ❶ 11 #define GPFS_SUPER_MAGIC 0x47504653 ❶ 12 #define PVFS2_SUPER_MAGIC 0x20030528 ❶ 13 #define PAN_KERNEL_FS_CLIENT_SUPER_MAGIC \ ❶ 0xAAD7AAEA ❶ 14 15 int main(int argc, char *argv[]) 16 { 17 struct statfs buf; 18 statfs("./fs_detect", &buf); ❷ 19 printf("File system type is %lx\n",buf.f_type); 20 }
MPI_IO_Examples/mpi_io_block2d/fs_detect.c 1 #include <stdio.h> 2 #ifdef __APPLE_CC__ 3 #include <sys/mount.h> 4 #else 5 #include <sys/statfs.h> 6 #endif 7 // Filesystem types are listed in the system // include directory in linux/magic.h 8 // You will need to add any additional // parallel filesystem magic codes 9 #define LUSTRE_MAGIC1 0x858458f6 ❶ 10 #define LUSTRE_MAGIC2 0xbd00bd0 ❶ 11 #define GPFS_SUPER_MAGIC 0x47504653 ❶ 12 #define PVFS2_SUPER_MAGIC 0x20030528 ❶ 13 #define PAN_KERNEL_FS_CLIENT_SUPER_MAGIC \ ❶ 0xAAD7AAEA ❶ 14 15 int main(int argc, char *argv[]) 16 { 17 struct statfs buf; 18 statfs("./fs_detect", &buf); ❷ 19 printf("File system type is %lx\n",buf.f_type); 20 }
❶ Magic numbers for parallel filesystem type
我们在此列表中包含了一些并行文件系统的幻数。当将其用于其他应用程序时,请将第 18 行的文件名替换为写入文件的目录的相应文件名。构建 fs_detect 程序,然后运行以下命令以获取文件系统类型:
We included the magic number for some of the parallel filesystems in this listing. When using this for other applications, replace the filename on line 18 with an appropriate filename for the directory where your files are written. Build the fs_detect program and then run the following command to get the filesystem type:
mkdir build && cd build cmake .. make grep `./fs_detect | cut -f 4 -d' '` /usr/include/linux/magic.h ../fs_detect.c
mkdir build && cd build cmake .. make grep `./fs_detect | cut -f 4 -d' '` /usr/include/linux/magic.h ../fs_detect.c
现在我们已经准备好使用特定于文件系统的提示了。我们没有列出所有可能的提示。您可以使用前面显示的命令获取当前列表。
Now we are ready for the filesystem-specific hints. We don’t list all the possible hints. You can get the current list by using the commands previously shown.
Lustre filesystem: The most common filesystem in high performance computing centers
Lustre 是最大的高性能计算系统上的主要文件系统。它起源于卡内基梅隆大学,其主要开发和所有权已通过 Intel、HP、Sun、Oracle、Intel、Whamcloud 等公司转移。在这个过程中,它从商业到开源,然后再返回。目前,它位于开放可扩展文件系统 (OpenSFS) 和欧洲开放文件系统 (EOFS) 的横幅下。
Lustre is the dominant filesystem on the largest high performance computing systems. Originating at Carnegie Mellon University, its primary development and ownership has been passed through Intel, HP, Sun, Oracle, Intel, Whamcloud, and others. In this process, it has passed from commercial to open source and back. Currently it is under the Open Scalable File Systems (OpenSFS) and European Open File Systems (EOFS) banners.
Lustre 建立在对象存储概念之上,具有对象存储服务器 (OSS) 和对象存储目标 (OST)。当我们在清单 16.2 的第 56 行和清单 16.6 的第 96 行指定striping_factor 8 时,我们告诉 ROMIO 库使用 Lustre 将写入(和读取)分成 8 个部分,并将它们发送到 8 个 OST,从而有效地以 8 向并行方式写出数据。striping_unit提示指示 ROMIO 和 Lustre 使用 4 MiB 条带大小。Lustre 还具有元数据服务器 (MDS) 和元数据目标 (MDT),用于存储文件每个部分存储位置的关键描述。对于条带化操作,请使用以下内容:
Lustre is built on the concept of object storage with Object Storage Servers (OSSs) and Object Storage Targets (OSTs). When we specify a striping_factor of 8 on line 56 of listing 16.2 and line 96 of listing 16.6, we are telling the ROMIO library to use Lustre to break up the writes (and reads) into eight pieces and send them to eight OSTs, effectively writing out the data in eight-way parallelism. The striping_unit hint tells ROMIO and Lustre to use 4 MiB stripe sizes. Lustre also has Metadata Servers (MDS) and Metadata Targets (MDT) to store the critical descriptions of where each part of the file is stored. For striping operations, use the following:
我们可以通过命令行查询确认 OpenMPI 的 Lustre 参数:
We can confirm the Lustre parameters for OpenMPI with a command-line query:
ompi_info --param fs lustre --level 9 MCA fs lustre: parameter "fs_lustre_priority" (current value: "20" ... MCA fs lustre: parameter "fs_lustre_stripe_size" (current value: "0" ... MCA fs lustre: parameter "fs_lustre_stripe_width" (current value: "0" ...
ompi_info --param fs lustre --level 9 MCA fs lustre: parameter "fs_lustre_priority" (current value: "20" ... MCA fs lustre: parameter "fs_lustre_stripe_size" (current value: "0" ... MCA fs lustre: parameter "fs_lustre_stripe_width" (current value: "0" ...
IBM 系统具有通用并行文件系统 (GPFS),这也是其 Spectrum Scale 产品的一部分,可在其系统上提供条带化和并行文件操作。GPFS 是一种企业存储产品,具有相应的支持基础设施和服务。默认情况下,GPFS 在所有可用设备上进行条带化。但是,MPI 提示可能对此文件系统没有那么大的影响。对于 MPICH (ROMIO),请使用以下命令来帮助进行大内存写入/读取:
IBM systems have the General Parallel File System (GPFS), also part of their Spectrum Scale product, that offers striping and parallel file operations on their systems. GPFS is an enterprise storage product with the corresponding support infrastructure and services. GPFS stripes across all available devices by default. The MPI hints may not have as much effect on this filesystem, however. For MPICH (ROMIO), use this command to help with large memory writes/reads:
IBM_largeblock_io=true
IBM_largeblock_io=true
DataWarp: A filesystem from Cray
Cray 的 DataWarp 将突发缓冲区硬件集成到另一个并行文件系统之上,例如他们的 Lustre 版本。不过,利用突发缓冲区仍处于起步阶段,但 Cray 一直是这项工作的领导者。
Cray’s DataWarp integrates burst buffer hardware on top of another parallel filesystem, such as their version of Lustre. Taking advantage of burst buffers is still in its infancy, though, but Cray has been a leader in this effort.
Panasas®: A commercial filesystem requiring fewer hints from users
Panasas® 是一个商业并行文件系统,由对象存储和元数据服务器组成。Panasas 还为网络文件系统 (NFS) 的扩展做出了贡献,以支持并行操作。Panasas 被用于 LANL 排名前十的计算系统中的一些系统中,尽管它今天并不那么普遍。对于 MPICH (ROMIO),请使用以下命令分别设置条带大小和条带数量:
Panasas® is a commercial parallel filesystem that is composed of object storage and metadata servers. Panasas has also contributed to the extension to the Network File System (NFS) to support parallel operations. Panasas was used in some of the top-ten computing systems at LANL, although it is not so prevalent there today. For MPICH (ROMIO), use these commands to set the strip size and the number of stripes, respectively:
OrangeFS (PVFS): The most popular open-source filesystem
OrangeFS,以前称为并行虚拟文件系统 (PVFS),是来自克莱姆森大学和阿贡国家实验室的开源并行文件系统。它在 Beowulf 集群上很受欢迎。除了作为可扩展的并行文件系统外,OrangeFS 还已集成到 Linux 内核中。您可以对 MPICH (ROMIO) 使用以下命令分别设置条带大小(以字节为单位)和对条带进行编号(-1 为自动):
OrangeFS, previously known as the Parallel Virtual File System (PVFS), is an open source parallel filesystem from Clemson University and Argonne National Laboratory. It is popular on Beowulf clusters. Besides being a scalable parallel filesystem, OrangeFS has been integrated into the Linux kernel. You can use the following commands for MPICH (ROMIO) to set the stripe size (in bytes) and to number the stripes (with -1 being automatic), respectively:
BeeGFS: A new open source filesystem that is gaining in popularity
BeeGFS(以前称为 FhGFS)由 Fraunhofer 高性能计算中心开发、可免费获得。它因其开源特性而广受欢迎。
BeeGFS, formerly FhGFS, was developed at the Fraunhofer Center for High Performance Computing and is freely available. It is popular because of its open source characteristics.
Distributed Application Object Storage (DAOS): Setting new benchmarks for performance
英特尔正在能源部 (DOE) FastForward 计划下开发其新的开源 DAOS 对象存储技术。DAOS 在 2020 年 ISC IO500 超级计算文件速度榜单 (https://www.vi4io.org) 中排名第一。它计划于 2021 年部署在阿贡国家实验室的第一个百万兆次级计算系统 Aurora 超级计算机上。DAOS 在 ROMIO MPI-IO 库中受支持,随 MPICH 提供,并且可以移植到其他 MPI 库。
Intel is developing their new, open source DAOS object-storage technology under the Department of Energy (DOE) FastForward program. DAOS ranks first in the 2020 ISC IO500 supercomputing file-speed list (https://www.vi4io.org). It’s scheduled to be deployed on the Aurora supercomputer, Argonne National Laboratory’s first exascale computing system, in 2021. DAOS is supported in the ROMIO MPI-IO library, available with MPICH, and is portable to other MPI libraries.
WekaIO: A newcomer from the big data community
WekaIO 是一个完全符合 POSIX 标准的文件系统,它提供了一个具有高度优化性能、低延迟和高带宽的大型共享命名空间,并使用最新的固态硬件组件。WekaIO 是一个有吸引力的文件系统,适用于需要大量高性能数据文件操作的应用程序,在大数据社区中很受欢迎。WekaIO 在 2019 年 SC IO500 超级计算文件速度列表中获得了最高荣誉。
WekaIO is a fully POSIX-compliant filesystem that provides a large shared namespace with highly optimized performance, low latency, and high bandwidth, and uses the latest solid-state hardware components. WekaIO is an attractive filesystem for applications that require large amounts of high-performing data file manipulation and is popular in the big data community. WekaIO took top honors in the 2019 SC IO500 supercomputing file speed list.
Ceph Filesystem: An open source distributed storage system
Ceph 起源于劳伦斯利弗莫尔国家实验室。该开发现在由 RedHat 领导,由工业合作伙伴联盟领导,并已集成到 Linux 内核中。
Ceph originated at Lawrence Livermore National Laboratory. The development is now led by RedHat for a consortium of industrial partners and has been integrated into the Linux kernel.
Network Filesystem (NFS): The most common network filesystem
NFS 是本地组织中网络的主要集群文件系统。对于高度并行的文件操作,它不是推荐的系统,尽管通过适当的设置,它可以正常工作。
NFS is the dominant cluster filesystem for the networks in local organizations. It is not a recommended system for highly parallel file operations, although with the proper settings, it functions correctly.
当前有关并行文件操作的大部分文档都位于演示文稿和学术会议中。最好的会议之一是并行数据系统研讨会 (PDSW),它与高性能计算、网络、存储和分析国际会议(也称为年度超级计算会议)一起举行。
Much of the current documentation on parallel file operations is in presentations and academic conferences. One of the best conferences is the Parallel Data Systems Workshop (PDSW), held in conjunction with The International Conference for High Performance Computing, Networking, Storage, and Analysis (otherwise known as the yearly Supercomputing Conference).
您可以使用微基准测试 IOR 和 mdtest 来检查文件系统的最佳性能。该软件最迟记录 https://ior.readthedocs.io/en/ 并由 LLNL 在 https://github.com/hpc/ior 托管。
You can use the micro benchmarks, IOR and mdtest, to check the best performance of a filesystem. The software is documented at https://ior.readthedocs.io/en/ latest/ and hosted by LLNL at https://github.com/hpc/ior.
以下文本介绍了向 MPI 添加 MPI-IO 函数。它仍然是对 MPI-IO 的最佳描述之一。
The addition of the MPI-IO functions to MPI is described in the following text. It remains one of the best descriptions of MPI-IO.
威廉·格罗普、拉吉夫·塔库尔和尤因·拉斯克。使用 MPI-2:消息传递接口的高级功能 (MIT Press, 1999)。
William Gropp, Rajeev Thakur, and Ewing Lusk. Using MPI-2: Advanced Features of the Message Passing Interface (MIT Press, 1999).
There are a couple of good books on writing high performance parallel file operations. We recommend the following:
Prabhat 和 Quincey Koziol,编辑,高性能并行 I/O(Chapman 和 Hall/CRC,2014 年)。
Prabhat and Quincey Koziol, editors, High Performance Parallel I/O (Chapman and Hall/CRC, 2014).
John M. May, Parallel I/O for High Performance Computing (Morgan Kaufmann, 2001).
The HDF Group maintains the authoritative website on HDF5. You can get more information at
HDF Group,https://portal.hdfgroup.org/display/HDF5/HDF5。
The HDF Group, https://portal.hdfgroup.org/display/HDF5/HDF5.
NetCDF 在某些 HPC 应用领域仍然很受欢迎。您可以在 Unidata 托管的 NetCDF 网站上获得有关此格式的更多信息。Unidata 是大学大气研究公司 (UCAR) 的社区计划 (UCP) 之一。
NetCDF remains popular within certain HPC application segments. You can get more information on this format at the NetCDF site hosted by Unidata. Unidata is one of the University Corporation for Atmospheric Research (UCAR)’s Community Programs (UCP).
Unidata, https://www.unidata.ucar.edu/software/netcdf/.
Unidata, https://www.unidata.ucar.edu/software/netcdf/.
NetCDF 的并行版本 PnetCDF 由西北大学和阿贡国家实验室独立于 Unidata 开发。有关 PnetCDF 的更多信息,请访问其 GitHub 文档站点:
A parallel version of NetCDF, PnetCDF, was developed by Northwestern University and Argonne National Laboratory independently from Unidata. More information on PnetCDF is at their GitHub documentation site:
西北大学和阿贡国家实验室,https://parallel-netcdf .github.io。
Northwestern University and Argonne National Laboratory, https://parallel-netcdf .github.io.
ADIOS 是由橡树岭国家实验室 (ORNL) 领导的团队维护的领先并行文件操作库之一。要了解更多信息,请参阅以下 Web 站点上的文档:
ADIOS is one of the leading parallel file operations libraries maintained by a team led by Oak Ridge National Laboratory (ORNL). To learn more, see their documentation at the following website:
橡树岭国家实验室,https://adios2.readthedocs.io/en/latest/index.html。
Oak Ridge National Laboratory, https://adios2.readthedocs.io/en/latest/index.html.
Some good presentations on tuning performance for filesystems include
Philippe Wautelet,“并行 IO 和 MPI-IO 提示的最佳实践”(CRNS/IDRIS,2015 年),http://www.idris.fr/media/docs/docu/idris/idris_patc_hints_ proj.pdf。
Philippe Wautelet, “Best practices for parallel IO and MPI-IO hints” (CRNS/ IDRIS, 2015), http://www.idris.fr/media/docs/docu/idris/idris_patc_hints_ proj.pdf.
George Markomanolis,ORNL 频谱量表 (GPFS) https://www.olcf.ornl .gov/wp-content/uploads/2018/12/spectrum_scale_summit_workshop.pdf。
George Markomanolis, ORNL Spectrum Scale (GPFS) https://www.olcf.ornl .gov/wp-content/uploads/2018/12/spectrum_scale_summit_workshop.pdf.
Check for the hints available on your system using the techniques described in section 16.6.1.
在具有更大数据集的系统上尝试 MPI-IO 和 HDF5 示例,看看您可以实现什么性能。将其与 IOR 微观基准进行比较以获得额外积分。
Try the MPI-IO and HDF5 examples on your system with much larger datasets to see what performance you can achieve. Compare that to the IOR micro benchmark for extra credit.
Use the h5ls and h5dump utilities to explore the HDF5 data file created by the HDF5 example.
有一种正确的方法可以处理并行应用程序的标准文件操作。本章中介绍的简单技术(其中所有 IO 都从第一个处理器执行)对于中等程度的并行应用程序来说已经足够了。
There is a proper way to handle standard file operations for parallel applications. The simple techniques introduced in this chapter, where all IO is performed from the first processor, are sufficient for modest parallel applications.
The use of MPI-IO is an important building block for parallel file operations. MPI-IO can dramatically speed up the writing and reading of files.
There are advantages of using the self-describing parallel HDF5 software. The HDF5 format can improve how your application manages data while also getting fast file operations.
There are ways to query and set the hints for the parallel file software and filesystem. This can improve your file writing and reading performance on particular systems.
为什么用一整章来介绍工具和资源?尽管我们在前面的章节中提到了工具和资源,但本章进一步讨论了高性能计算程序员可用的各种选择。从版本控制系统到调试,可用的功能(无论是商业还是开源)对于实现并行应用程序开发的快速迭代至关重要。尽管如此,这些工具不是强制性的。了解这些并将其嵌入到您的工作流程中通常会产生巨大的好处,远远超过学习如何使用它们所花费的时间。
Why a whole chapter on tools and resources? Though we’ve mentioned tools and resources in previous chapters, this chapter further discusses the wide variety and alternatives available to high-performance computing programmers. From version control systems to debugging, the available capabilities, whether commercial or open source, are essential to enable the rapid iterations of parallel application development. Nonetheless, these tools are not mandatory. Having an understanding of and embedding these into your workflow often yields tremendous benefits, far outweighing the time spent learning how to use them.
工具是高性能计算开发过程中的一个重要部分。并非每个工具都适用于每个系统;因此,替代品的可用性很重要。在前面的章节中,我们想专注于这个过程,而不是陷入如何使用所有可能工具的细节中。我们选择为每种需求提供最简单、最可用的工具。我们还更喜欢命令行和基于文本的工具,而不是花哨的图形界面工具,因为在慢速网络上使用图形界面可能很困难,甚至不可能。图形工具也往往更以供应商或系统为中心,并且经常变化。尽管存在这些缺点,但我们在本章中包括了许多供应商工具,因为它们可以极大地改进高性能计算应用程序的代码开发。
Tools are an important piece of the high-performance computing development process. Not every tool works on every system; therefore, availability of alternatives is important. In the previous chapters, we wanted to focus on the process and not get bogged down in the details of how to use every possible tool. We chose to present the simplest, most available tool for each need. We also preferred the command-line and text-based tools over the fancy graphical interface tools because using graphics interfaces over slow networks can be difficult or even impossible. Graphical tools also tend to be more vendor- or system-centric and often change. Despite these drawbacks, we include many of these vendor tools in this chapter because they can greatly improve your code development for high-performance computing applications.
各种基准测试应用程序等资源非常有价值,因为应用程序不只有一种风格。对于这些专门的应用领域,我们需要更合适的基准测试和迷你应用程序,以探索算法开发的最佳方法和每种架构的正确编程模式。我们强烈建议您从这些资源中学习,而不是从头开始重新发明技术。对于大多数工具,我们提供了有关安装的简要说明以及在哪里可以找到一些文档。我们还在本章的配套代码中提供了更多详细信息 https://github.com/EssentialsofParallelComputing/ Chapter17.
Resources such as a wide variety of benchmark applications, are valuable because applications don’t come in just one flavor. For these specialized application domains, we need more appropriate benchmarks and mini-apps that explore the best approach for algorithm development and the right programming pattern for each architecture. We strongly recommend that you learn from these resources rather than reinventing the techniques from scratch. For most of the tools, we give brief instructions on installation and where to find some documentation. We also provide more detail in the companion code for this chapter at https://github.com/EssentialsofParallelComputing/ Chapter17.
我们非常不受供应商限制,并且也强调可移植性。尽管我们涵盖了很多工具,但不可能详细介绍所有这些工具。此外,这些工具的变化速度超过了高性能计算生态系统的其他工具。历史表明,对优秀工具开发的支持是变化无常的。因此,这些工具来来去去,更改所有权的速度比更新文档的速度要快。
We are strongly vendor-agnostic and stress portability as well. Although we cover a lot of tools, it just isn’t possible to go into detail on all of them. In addition, the rate of change for these tools exceeds that of the rest of the high-performance computing ecosystem. History has shown that the support for good tool development is fickle. Thus, the tools come and go and change ownership more quickly than documentation can be updated.
为了快速参考,表 17.1 提供了本章中介绍的工具的摘要。这些工具显示在相应的类别中,以帮助您找到最适合您需求的工具。我们包含了各种各样的工具,因为可能只有一个工具适用于特定的硬件或操作系统,或者可能具有专门的功能。我们选择在本章的以下部分中提供有关一些更简单、更有用和最常用的工具的更多详细信息,如表中所示。
For a quick reference, table 17.1 provides a summary of the tools we cover in this chapter. These are shown in their corresponding categories to help you find the best tools for your needs. We included a wide variety of tools because there may be only one that works on a particular hardware or operating system or may have specialized capabilities. We have chosen to give more details on some of the simpler, more useful and commonly used tools in the following sections of this chapter as indicated in the table.
Table 17.1 Summary of tools covered in this chapter
软件的版本控制是最基本的软件工程实践之一,在开发并行应用程序时至关重要。我们在 2.1.1 节中介绍了版本控制在并行应用程序开发中的作用。在这里,我们将更详细地介绍各种版本控制系统及其特性。版本控制系统可以分为两大类,分布式和集中式,如图 17.1 所示。
Version control for software is one of the most basic of software engineering practices and critically important when developing parallel applications. We covered the role of version control in parallel application development in section 2.1.1. Here, we go into more detail on the various version control systems and their characteristics. Version control systems can be broken down into two major categories, distributed and centralized, as figure 17.1 shows.
图 17.1 选择版本控制类型取决于您的工作模式。集中式版本控制适用于每个人都位于可以访问单个服务器的位置。分布式版本控制为您提供笔记本电脑和台式机上存储库的完整副本,并允许您在全球范围内移动。
Figure 17.1 Selecting a type of version control is dependent on your work pattern. Centralized version control is for when everyone is at a location with access to a single server. Distributed version control gives you a full copy of your repository on your laptop and desktop and allows you to go worldwide and mobile.
在集中式版本控制系统中,只有一个中央存储库。这需要连接到存储库站点才能对存储库执行任何操作。在分布式版本控制系统中,各种命令(如 clone)会创建存储库的重复(远程)版本并签出源代码。您可以在旅行时将更改提交到存储库的本地版本,然后在以后将更改推送或合并到主存储库中。难怪分布式版本控制系统近年来越来越受欢迎。也就是说,这些也带来了另一层复杂性。
In a centralized version control system there is just one central repository. This requires a connection to the repository site to do any operations on the repository. In a distributed version control system, various commands, such as clone, create a duplicate (remote) version of the repository and a checkout of the source. You can commit your changes to your local version of the repository while traveling, then push or merge the changes into the main repository at a later time. No wonder distributed version control systems have gained popularity in recent years. That said, these also come with another layer of complexity.
许多代码团队分散在全球各地或一直在移动。对他们来说,分布式版本控制系统最有意义。两个最常见的免费分布式版本控制系统是 Git 和 Mercurial。还有其他几个较小的分布式版本控制系统。所有这些实施都支持各种开发人员工作流程。
Many code teams are scattered across the globe or on the move all the time. For them, a distributed version control system makes the most sense. The two most common freely-available distributed version control systems are Git and Mercurial. There are several other smaller distributed version control systems as well. All of these implementations support a variety of developer workflows.
尽管声称易于学习,但这些工具很难完全理解和正确使用。需要一整本书来涵盖这些。幸运的是,有许多 Web 教程和书籍涵盖了它们的用途。Git 资源的一个很好的起点是 Git SCM 站点:https://git-scm.com。
Despite claims to be easy to learn, these are complex tools to fully understand and use properly. It would take a full book to cover each of these. Fortunately, there are many web tutorials and books that cover their use. A good starting point for Git resources is the Git SCM site: https://git-scm.com.
Mercurial 比 Git 更简单一些,设计更简洁。此外,Mercurial 网站提供了许多入门教程。
Mercurial is a bit simpler and has a cleaner design than Git. Additionally, the Mercurial website has a lot of tutorials to get you started.
Mercurial at https://www.mercurial-scm.org/wiki/Mercurial
Mercurial at https://www.mercurial-scm.org/wiki/Mercurial
Bryan O'Sullivan,《Mercurial: The Definitive Guide》(Mercurial: The Definitive Guide,O'Reilly Media,2009 年),http:// hgbook.red-bean.com
Bryan O’Sullivan, Mercurial: The Definitive Guide (O’Reilly Media, 2009), http:// hgbook.red-bean.com
还有一些商业分布式的版本控制系统。Perforce 和 ClearCase 是最有名的。借助这些产品,您可以获得更多支持,这对您的组织可能很重要。
There are also some commercially distributed version control systems. Perforce and ClearCase are the best known. With these products, you can get more support, which might be important for your organization.
虽然在软件配置管理的悠久历史中开发了许多集中式版本控制系统,但目前最常用的两个是并发版本系统 (CVS) 和 Subversion (SVN)。现在,这两者都有点过时了,因为人们的兴趣已经转向分布式版本控制。但是,如果以适当的方式用于集中式存储库,则这些既有效又更易于使用。
While there are many centralized version control systems that have been developed over a long history of software configuration management, the two most commonly used today are Concurrent Versions System (CVS) and Subversion (SVN). Both are a bit dated these days as interest has shifted towards distributed version control. If used in the proper way for a centralized repository, however, these are both effective and much simpler to use.
集中式版本控制还为专有代码提供了更好的安全性,因为它只有一个需要保护存储库的位置。因此,集中式版本控制在企业环境中仍然很流行,其中限制对源代码历史记录的访问至关重要。CVS 有一个简单的分支操作,运行良好。CVS 网站上有文档和一本广泛可用的书:
Centralized version control also provides better security for proprietary codes by having only one place where the repository needs to be protected. For this reason, centralized version control is still popular in the corporate environment, where limiting access to the source code history is of paramount importance. CVS has a simple branching operation that works well. There is documentation at the CVS website and a widely available book:
CVS(自由软件基金会,1998 年),https://www.nongnu.org/cvs/
CVS (Free Software Foundation, Inc., 1998) at https://www.nongnu.org/cvs/
Per Cederqvist,《使用 CVS 进行版本管理》(Network Theory Ltd,2002 年 12 月),可在各种在线站点和印刷版上获得
Per Cederqvist, Version Management with CVS (Network Theory Ltd, December, 2002), available at various sites online and in print
Subversion 是作为 CVS 的替代品而开发的。虽然在很多方面都比 CVS 有所改进,但分支功能比 CVS 要弱一些。有一本关于 Subversion 的好书,开发正在进行中:
Subversion was developed as a replacement for CVS. Although in many respects, it is an improvement over CVS, the branching function is a bit weaker than that in CVS. There is a good book on Subversion and development is ongoing:
Ben Collins-Sussman、Brian W. Fitzpatrick 和 C. Michael Pilato,《使用 Subversion 进行版本控制》(Apache Software Foundation,2002 年),http://svnbook.red-bean.com
Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato, Version Control with Subversion (Apache Software Foundation, 2002), http://svnbook.red-bean.com
将内部计时器放入应用程序中以跟踪您处理应用程序时的性能会很有帮助。我们在清单 17.1 和 17.2 中展示了一个具有代表性的计时例程,您可以在 C、C++ 和 Fortran 中使用带有 Fortran 包装例程的时序例程。此例程使用具有 CLOCK_MONOTONIC 类型的 clock_gettime 例程,以避免 clock time 调整问题。
It is helpful to put internal timers into your application to track performance as you work on it. We show a representative timing routine in listings 17.1 and 17.2 that you can use in C, C++, and Fortran with a Fortran wrapper routine. This routine uses the clock_gettime routine with a CLOCK_MONOTONIC type to avoid problems with clock time adjustments.
Listing 17.1 Timer header file
timer.h 1 #ifndef TIMER_H 2 #define TIMER_H 3 #include <time.h> 4 5 void cpu_timer_start1(struct timespec *tstart_cpu); 6 double cpu_timer_stop1(struct timespec tstart_cpu); 7 #endif
timer.h 1 #ifndef TIMER_H 2 #define TIMER_H 3 #include <time.h> 4 5 void cpu_timer_start1(struct timespec *tstart_cpu); 6 double cpu_timer_stop1(struct timespec tstart_cpu); 7 #endif
Listing 17.2 Timer source file
timer.c
1 #include <time.h>
2 #include "timer.h"
3
4 void cpu_timer_start1(struct timespec *tstart_cpu)
5 {
6 clock_gettime(CLOCK_MONOTONIC, tstart_cpu); ❶
7 }
8 double cpu_timer_stop1(struct timespec tstart_cpu)
9 {
10 struct timespec tstop_cpu, tresult;
11 clock_gettime(CLOCK_MONOTONIC, &tstop_cpu); ❶
12 tresult.tv_sec = tstop_cpu.tv_sec - tstart_cpu.tv_sec;
13 tresult.tv_nsec = tstop_cpu.tv_nsec - tstart_cpu.tv_nsec;
14 double result = (double)tresult.tv_sec +
(double)tresult.tv_nsec*1.0e-9;
15
16 return(result);
17 }timer.c
1 #include <time.h>
2 #include "timer.h"
3
4 void cpu_timer_start1(struct timespec *tstart_cpu)
5 {
6 clock_gettime(CLOCK_MONOTONIC, tstart_cpu); ❶
7 }
8 double cpu_timer_stop1(struct timespec tstart_cpu)
9 {
10 struct timespec tstop_cpu, tresult;
11 clock_gettime(CLOCK_MONOTONIC, &tstop_cpu); ❶
12 tresult.tv_sec = tstop_cpu.tv_sec - tstart_cpu.tv_sec;
13 tresult.tv_nsec = tstop_cpu.tv_nsec - tstart_cpu.tv_nsec;
14 double result = (double)tresult.tv_sec +
(double)tresult.tv_nsec*1.0e-9;
15
16 return(result);
17 }
❶ Calls clock_gettime requesting a monotonic clock
如果您需要替代例程,还可以使用其他 timer 实现。可移植性是您可能需要另一种实现的原因之一。自 Sierra 10.12 以来,macOS 一直支持时钟_gettime例程,这有助于解决一些可移植性问题。
There are other timer implementations that you can use if you need an alternative routine. Portability is one reason you may want another implementation. The clock _gettime routine has been supported on the macOS since Sierra 10.12, which has helped with some of the portability issues.
Alternative timer implementations
如果您使用的是符合 2011 标准的 C++,则可以使用高分辨率时钟 std::chrono::high_resolution_clock。在这里,我们展示了一个替代计时器列表,您可以在 C、C++ 和 Fortran 之间实现可移植性。
If you are using C++ with 2011 standards, you can use the high resolution clock, std::chrono::high_resolution_clock. Here we show a list of alternative timers you can use with portability across C, C++, and Fortran.
clock_gettime 函数有两个版本。尽管 CLOCK_MONOTONIC 是首选类型,但它并不是可移植操作系统接口 (POSIX) 的必需类型,POSIX 是跨操作系统可移植性的标准。在本章随附的示例的 timers 目录中,我们包含一个 CLOCK_REALTIME timer 类型的版本。gettimeofday 和 getrusage 函数具有广泛的可移植性,并且可能在 clock_gettime 不能的系统上运行。
The clock_gettime function has two versions. Although the CLOCK_MONOTONIC is preferred, it’s not a required type for Portable Operating System Interface (POSIX), the standard for portability across operating systems. In the timers directory of the examples that accompany this chapter, we include a version with the CLOCK_REALTIME timer type. The gettimeofday and getrusage functions are widely portable and might work on systems where clock_gettime does not.
分析器是一种程序员工具,用于测量应用程序性能的某些方面。我们在前面的 2.2 节 和 3.3 节中介绍了作为应用程序开发过程的关键部分的性能分析,并介绍了几个更简单的分析工具。在本节中,我们将介绍一些您可能会考虑用于应用程序开发的替代分析工具,并向您介绍如何使用更多更简单的分析器。在以下情况下,Profiler 是开发并行应用程序的重要工具:
A profiler is a programmer tool that measures some aspect of the performance of an application. We covered profiling earlier in sections 2.2 and 3.3 as a key part of the application development process and introduced a couple of the simpler profiling tools. In this section, we’ll cover some of the alternative profiling tools that you might consider for your application development and introduce you to how to use some more of the simpler profilers. Profilers are important tools in developing parallel applications when:
You want to work on a section of code that has the most impact in improving the performance of your application. This section of code is often referred to as the bottleneck.
You want to measure your performance improvement on various architectures. After all, we are all about performance in high performance computing applications.
Profiler 有多种形状和大小。我们将讨论分为反映其广泛特征的类别。使用适当类别的工具很重要。当您只想找到最大的瓶颈时,不建议使用重量级的分析工具。错误的工具会让您埋在信息雪崩中,让您花费数小时或数天的时间挖掘自己。当您真正需要深入了解应用程序的低级细节时,请保存这些重量级工具。我们建议从简单的分析器开始,然后在需要时逐步使用详细的分析器。我们的分析器类别遵循从简单到复杂的层次结构,并带有一些主观判断,其中每个分析工具都位于列表中。
Profilers come in a variety of shapes and sizes. We’ll break down our discussion into categories that reflect their broad characteristics. It is important to use a tool from the appropriate category. It is not advisable to use a heavy-weight profiling tool when all you want is to find the biggest bottleneck. The wrong tool will bury you in an avalanche of information that will leave you digging yourself out for hours or days. Save the heavy-weight tools when you really need to dive down into the low-level details of your application. We suggest starting with simple profilers and working up to the detailed profilers when needed. Our categories of profilers follow this simple-to-complex hierarchy with some subjective judgment where each profiling tool falls in the list.
Table 17.2 Categories of profiling tools (simple to complex)
|
一个自上而下的分析器,突出显示需要改进的例程,通常在图形用户界面中 A top-down profiler that highlights the routines needing improvement, often in a graphical user interface |
|
高级别并不表示高细节;这些工具提供了应用程序性能的 25,000 英尺图。我们发现自己又回到了更简单的分析工具,比如简单的基于文本的分析器和高级分析器,因为它们使用起来很快,而且不会占用大部分时间。
High-level is not indicative of high-detail; these tools give the 25,000 ft picture of an application’s performance. We find ourselves returning to the simpler profiling tools, such as the simple text-based and high-level profilers because these are quick to use and don’t take up most of the day.
简单的基于文本的分析器,如 LIKWID、gprof、gperftools、timemory 和 Open|SpeedShop 很容易整合到您的日常应用程序开发工作流程中。这些提供了对性能的快速了解。
Simple text-based profilers like LIKWID, gprof, gperftools, timemory, and Open|SpeedShop are easy to incorporate into your daily application development workflow. These provide a quick insight on performance.
likwid (Like I Know What I'm Doing) 工具套件首次在第 3.3.1 节中介绍,并在第 4、6 和 9 章中使用。由于它的简单性,我们广泛使用它。likwid 网站上有大量的文档:
The likwid (Like I Knew What I’m Doing) suite of tools was first introduced in section 3.3.1 and also used in chapters 4, 6, and 9. We used it extensively because of its simplicity. There is ample documentation at the likwid website:
https://hpc.fau.de/research/tools/likwid/ 的 Likwid 性能工具
likwid performance tools at https://hpc.fau.de/research/tools/likwid/
多年来,古老的 gprof 工具一直是 Linux 上分析应用程序的支柱。我们在 Section 13.4.2 中使用了它来快速分析我们的应用程序。Gprof 使用采样方法来衡量应用程序将时间花费在何处。它是一个命令行工具,在编译和链接应用程序时通过添加 -pg 来启用。然后,当您的应用程序运行时,它会在完成时生成一个名为 gmon.out 的文件。然后,命令行 gprof 实用程序将性能数据显示为文本输出。Gprof 随大多数 Linux 系统一起提供,是 GCC 和 Clang/LLVM 编译器的一部分。Gprof 相对过时,但很容易获得且易于使用。gprof 文档相当简单,可以在以下站点广泛获得:
The venerable gprof tool has been a mainstay for profiling applications on Linux for many years. We used it in section 13.4.2 for a quick profile of our application. Gprof uses a sampling approach to measure where the application is spending its time. It is a command-line tool that is enabled by adding -pg when compiling and linking your application. Then, when your application runs, it produces a file called gmon.out at completion. The command-line gprof utility then displays the performance data as text output. Gprof comes with most Linux systems and is part of the GCC and Clang/LLVM compilers. Gprof is relatively dated, but is readily available and simple to use. The gprof documentation is fairly simple and is widely available at the following site:
来自 https:// sourceware.org/binutils/docs/gprof/index.html 的自由软件基金会的 GNU Binutils 文档
GNU Binutils documentation from The Free Software Foundation at https:// sourceware.org/binutils/docs/gprof/index.html
gperftools 套件(以前称为 Google Performance Tools)是一种较新的分析工具,其功能类似于 gprof。该工具套件还附带了 TCMalloc,这是一种适用于使用线程的应用程序的快速 malloc。它还引入了内存泄漏检测器和堆分析器。gperftools CPU 分析器有一个网站,其中对该工具进行了简短介绍:
The gperftools suite (originally Google Performance Tools) is a newer profiling tool similar in functionality to gprof. The suite of tools also comes with TCMalloc, a fast malloc for applications that use threads. It also throws in a memory leak detector and a heap profiler. The gperftools CPU profiler has a website that has a short introduction to the tool:
https://gperftools.github.io/gperftools/cpuprofile.html 的 Gperftools (Google)
Gperftools (Google) at https://gperftools.github.io/gperftools/cpuprofile.html
来自美国国家能源研究科学计算中心 (NERSC) 的 timemory 工具是一个构建在许多其他性能测量接口之上的简单工具。此套件中最简单的工具 timem 是 Linux time 命令的替代品,它还可以输出其他信息,例如使用的内存以及读取和写入的字节数。值得注意的是,它有一个自动生成屋顶线图的选项。该工具在其文档网站上提供了广泛的使用信息:
The timemory tool from the National Energy Research Scientific Computing Center (NERSC) is a simple tool built on top of many other performance measurement interfaces. The simplest tool in this suite, timem, is a replacement for the Linux time command, which can also output additional information such as the memory used and the number of bytes read and written. Notably, it has an option to automatically generate a roofline plot. The tool has extensive use information at its documentation website:
https://timemory.readthedocs.io 上的 TiMemory 文档
timemory documentation at https://timemory.readthedocs.io
打开|SpeedShop 有一个命令行选项和 Python 界面,可能使其成为这些简单工具的可能替代品。它是一个更强大的工具,我们将在 17.3.4 节中讨论。
Open|SpeedShop has a command-line option and Python interface that might make it a possible substitute for these simple tools. It is a more powerful tool, which we’ll discuss in section 17.3.4.
高级工具是快速了解应用程序性能的最佳选择。这些工具通过专注于识别代码的高成本部分并提供基于图形的强大应用程序性能概述来脱颖而出。与简单的 Profiler 不同,您必须经常跳出工作流并启动图形应用程序才能使用这些高级 Profiler。
High-level tools are the best choice for a quick overview of the performance of your application. These tools distinguish themselves by focusing on identifying the high-cost parts of your code and giving a robust graphics-based overview of application performance. Unlike the simple profilers, you must often step out of your workflow and start a graphics application in order to use these high-level profilers.
我们首先在 3.3.1 节中讨论了 Cachegrind。Cachegrind 专门向您展示代码中的高成本路径,使您能够专注于性能关键部分。它有一个简单的图形用户界面,易于理解。
We first talked about Cachegrind in section 3.3.1. Cachegrind specializes in showing you the high-cost paths through your code, enabling you to focus on the performance critical parts. It has a simple graphical user interface that is easy to understand.
Cachegrind,https://valgrind.org/docs/manual/cg-manual.html 的缓存和分支预测分析器 (Valgrind™ Developers)
Cachegrind, a cache and branch-prediction profiler (Valgrind™ Developers) at https://valgrind.org/docs/manual/cg-manual.html
另一个很好的高级分析器是 Arm MAP 分析器,以前称为 Allinia Map 或 Forge Map。MAP 是一种商业工具,其母公司已经更改了几次。它利用图形用户界面,提供比 KCachegrind 更多的细节,但仍然专注于最突出的细节。MAP 工具有一个配套工具,即 DDT 调试器,它包含在 Arm Forge 高性能计算工具套件中。我们将在本章后面的 17.7.2 节中讨论 DDT 调试器。ARM 网站上提供了大量的文档、教程、网络研讨会和用户指南:
Another good high-level profiler is the Arm MAP profiler, previously named Allinia Map or Forge Map. MAP is a commercial tool and its parent firm has changed a few times. It utilizes a graphic user interface that gives more detail than KCachegrind, but still focuses on the most salient details. The MAP tool has a companion tool, the DDT debugger, that comes in the Arm Forge suite of high-performance computing tools. We’ll discuss the DDT debugger later in the chapter in section 17.7.2. There is extensive documentation, tutorials, webinars, and a user guide at the ARM website:
Arm MAP (Arm Forge) 在 http://mng.bz/n2x2
Arm MAP (Arm Forge) at http://mng.bz/n2x2
在尝试微调优化时,通常会使用中级分析器。许多旨在指导应用程序开发的图形用户界面工具都属于这一类。其中包括 Intel® Advisor、VTune、CrayPat、AMD μProf、NVIDIA Visual Profiler 和 CodeXL(以前是 Radeon 工具,现在是 GPUOpen 计划的一部分)。我们从更通用和流行的 CPU 工具开始,然后研究 GPU 的专用工具。
Medium-level profilers are often used when trying to fine-tune optimizations. Many of the graphical user interface tools designed to guide your application development fall into this category. These include Intel® Advisor, VTune, CrayPat, AMD μProf, NVIDIA Visual Profiler, and CodeXL (formerly a Radeon tool and now part of the GPUOpen initiative). We start with the more general and popular tools for CPUs and then work into the specialized tools for GPUs.
英特尔® Advisor 旨在指导英特尔编译器使用向量化。它显示哪些循环被向量化,并建议更改以向量化其他循环。虽然它对于向量化代码特别有用,但它也适用于一般性能分析。Advisor 是一个专有工具,但最近已免费提供给许多用户。您可以使用 Ubuntu 软件包管理器安装 Intel Advisor。您需要添加 Intel 软件包,然后使用 apt-get 安装带有 OneAPI 的版本。
Intel® Advisor is targeted at guiding the use of vectorization with Intel compilers. It shows which loops are vectorized and suggests changes to vectorize others. While it is especially useful for vectorizing code, it is also good for general profiling. Advisor is a proprietary tool, but recently has been made freely available for many users. You can install Intel Advisor using the Ubuntu package manager. You need to add the Intel package and then use apt-get to install the version with OneAPI.
wget -q https:/ /apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
echo "deb https:/ /apt.repos.intel.com/oneapi all main" >>
/etc/apt/sources.list.d/oneAPI.list
echo "deb [trusted=yes arch=amd64] https:/ /repositories.intel.com/graphics/ubuntu bionic
main" >> /etc/apt/sources.list.d/intel-graphics.list
apt-get update
apt-get install intel-oneapi-advisorwget -q https:/ /apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
echo "deb https:/ /apt.repos.intel.com/oneapi all main" >>
/etc/apt/sources.list.d/oneAPI.list
echo "deb [trusted=yes arch=amd64] https:/ /repositories.intel.com/graphics/ubuntu bionic
main" >> /etc/apt/sources.list.d/intel-graphics.list
apt-get update
apt-get install intel-oneapi-advisor
有关从其软件包存储库安装英特尔 OneAPI 软件的完整说明,请访问 http://mng.bz/veO4。
Complete instructions on installing the Intel OneAPI software from its package repository can be found at http://mng.bz/veO4.
英特尔® VTune 是一种通用优化工具,可帮助识别瓶颈和潜在的改进。它也是另一个免费提供的专有工具。VTune 可以使用 OneAPI 套件中的 apt-get 进行安装。
Intel® VTune is a general-purpose optimization tool that helps to identify bottlenecks and potential improvements. It is also another proprietary tool that’s freely available. VTune can be installed with apt-get from the OneAPI suite.
wget -q https:/ /apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
echo "deb https:/ /apt.repos.intel.com/oneapi all main" >>
/etc/apt/sources.list.d/oneAPI.list
echo "deb [trusted=yes arch=amd64] https:/ /repositories.intel.com/graphics/ubuntu bionic
main" >> /etc/apt/sources.list.d/intel-graphics.list
apt-get update
apt-get install intel-oneapi-vtunewget -q https:/ /apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
echo "deb https:/ /apt.repos.intel.com/oneapi all main" >>
/etc/apt/sources.list.d/oneAPI.list
echo "deb [trusted=yes arch=amd64] https:/ /repositories.intel.com/graphics/ubuntu bionic
main" >> /etc/apt/sources.list.d/intel-graphics.list
apt-get update
apt-get install intel-oneapi-vtune
CrayPat 工具是一种专有工具,仅在 Cray 操作系统上可用。它是一个出色的命令行工具,可提供有关循环和线程优化的简单反馈。如果您正在使用 Cray 操作系统的众多高性能计算站点之一工作,那么此工具可能值得研究。不幸的是,它在其他地方不可用。
The CrayPat tool is a proprietary tool that is only available on Cray Operating Systems. It is an excellent command-line tool that gives simple feedback on optimization of loops and threading. If you are working on one of the many high-performance computing sites that use the Cray Operating System, this tool may be worth investigating. Unfortunately, it is not available elsewhere.
AMD μProf 是 AMD 为其 CPU 和 APU 提供的分析工具。加速处理单元 (APU) 是 AMD 术语,指的是具有集成 GPU 的 CPU,它是在 AMD 收购 Radeon GPU 制造商 ATI 时首次推出的。集成单元比典型的集成 GPU 更紧密耦合,是 AMD 异构系统架构概念的一部分。您可以在 Ubuntu 或 Red Hat Enterprise Linux 上使用软件包安装程序安装 AMD μProf 工具。下载需要手动接受 EULA。要安装 AMD μProf,请执行以下步骤:
AMD μProf is the profiling tool from AMD for their CPUs and APUs. Accelerated Processing Unit (APU) is the AMD term for a CPU with an integrated GPU that was first introduced when AMD bought out ATI, manufacturer of the Radeon GPU. The integrated unit is more tightly coupled than a typical integrated GPU and is part of the Heterogeneous System Architecture concept from AMD. You can install the AMD μProf tool with package installers on Ubuntu or Red Hat Enterprise Linux. The download requires a manual acceptance of the EULA. To install AMD μProf, follow these steps:
Scroll down to the bottom of the page and select the appropriate file
Accept the EULA to start the download with the package manager
Ubuntu: dpkg --install amduprof_x.y-z_amd64.deb RHEL: yum install amduprof-x.y-z.x86_64.rpm
Ubuntu: dpkg --install amduprof_x.y-z_amd64.deb RHEL: yum install amduprof-x.y-z.x86_64.rpm
用户指南中提供了有关安装的更多详细信息,该指南可在 AMD 开发人员网站上获得:https://developer.amd.com/wordpress/media/2013/12/User_ Guide.pdf。
More details on installation are given in the user guide, which is available at the AMD developer website: https://developer.amd.com/wordpress/media/2013/12/User_ Guide.pdf.
NVIDIA Visual Profiler 是 CUDA 软件套件的一部分。它正在被整合到 NVIDIA® Nsight 工具套件中。我们在 Section 13.4.3 中介绍了这个工具。可以使用以下命令将 NVIDIA 工具安装在 Ubuntu Linux 发行版上:
NVIDIA Visual Profiler is part of the CUDA software suite. It is being incorporated into the NVIDIA® Nsight suite of tools. We covered this tool in section 13.4.3. The NVIDIA tools can be installed on the Ubuntu Linux distribution with the following commands:
wget -q https:/ /developer.download.nvidia.com/ compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.2.89-1_amd64.deb dpkg -i cuda-repo-ubuntu1804_10.2.89-1_amd64.deb apt-key adv --fetch-keys https:/ /developer.download.nvidia.com/ compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub apt-get update apt-get install cuda-nvprof-10-2 cuda-nsight-systems-10-2 cuda-nsight-compute-10-2
wget -q https:/ /developer.download.nvidia.com/ compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.2.89-1_amd64.deb dpkg -i cuda-repo-ubuntu1804_10.2.89-1_amd64.deb apt-key adv --fetch-keys https:/ /developer.download.nvidia.com/ compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub apt-get update apt-get install cuda-nvprof-10-2 cuda-nsight-systems-10-2 cuda-nsight-compute-10-2
CodeXL 是 GPUOpen 代码开发工作台,为 Radeon GPU 提供分析支持。它是 AMD 发起的 GPUOpen 开源计划的一部分。CodeXL 工具结合了调试器和探查器功能。CPU 分析已移至 AMD μProf 工具,以便 CodeXL 工具可以移至开源状态。按照说明在 Ubuntu 或 RedHat Linux 发行版上安装 CodeXL。
CodeXL is the GPUOpen code development workbench with profiling support for Radeon GPUs. It is part of the GPUOpen open source initiative begun by AMD. The CodeXL tool combines both debugger and profiler functionality. The CPU profiling has been moved to the AMD μProf tool so that the CodeXL tool could be moved to an open source status. Follow the instructions to install CodeXL on Ubuntu or RedHat Linux distributions.
wget https:/ /github.com/GPUOpen-Archive/
CodeXL/releases/download/v2.6/codexl-2.6-302.x86_64.rpm
RHEL or CentOS: rpm -Uvh --nodeps codexl-2.6-302.x86-64.rpm
Ubuntu: apt-get install rpm
rpm -Uvh --nodeps codexl-2.6-302.x86-64.rpmwget https:/ /github.com/GPUOpen-Archive/
CodeXL/releases/download/v2.6/codexl-2.6-302.x86_64.rpm
RHEL or CentOS: rpm -Uvh --nodeps codexl-2.6-302.x86-64.rpm
Ubuntu: apt-get install rpm
rpm -Uvh --nodeps codexl-2.6-302.x86-64.rpm
有几种工具可以生成详细的应用程序分析。如果您需要从应用程序中提取每一点性能,您应该学习使用这些工具中的至少一种。这些工具的挑战在于它们会产生如此多的信息,理解和使用结果可能非常耗时。您还需要具备一些硬件架构专业知识,才能真正理解分析数据。此类别中的工具应在您从更简单的分析工具中获得所有功能后使用。我们在本节中介绍的详细分析器是 HPCToolkit、Open|SpeedShop 和 TAU.
There are several tools that produce detailed application profiling. If you need to extract every bit of performance from your application, you should learn to use at least one of these tools. The challenge with these tools is that they produce so much information, it can be time-consuming to understand and use the results. You will also need to have some hardware architecture expertise to really make sense of the profiling data. The tools from this category should be used after you have gotten what you can out of the simpler profiling tools. The detailed profilers that we cover in this section are HPCToolkit, Open|SpeedShop, and TAU.
HPCToolkit 是一个功能强大、详细的分析器,由莱斯大学作为开源项目开发。HPCToolkit 使用硬件性能计数器来测量性能,并使用图形用户界面显示数据。在最新的高性能计算系统上开发超大规模是由美国能源部 (DOE) 百万兆次级计算项目赞助的。它的 hpcviewer GUI 从代码角度显示性能数据,而 hpctraceviewer 显示代码执行的时间跟踪。有关更多信息和详细的用户指南,请访问 HPCToolkit 网站。HPCToolkit 可以使用 spack install hpctoolkit 与 Spack 包管理器一起安装。
HPCToolkit is a powerful, detailed profiler developed as an open source project by Rice University. HPCToolkit uses hardware performance counters to measure performance and presents the data using graphical user interfaces. The development for extreme scale on the latest high-performance computing systems is sponsored by the Department of Energy (DOE) Exascale Computing Project. Its hpcviewer GUI shows performance data from a code perspective while the hpctraceviewer presents a time trace of the code execution. More information and detailed user guides are available at the HPCToolkit website. HPCToolkit can be installed with the Spack package manager with spack install hpctoolkit.
HPCToolkit at http://hpctoolkit.org
HPCToolkit at http://hpctoolkit.org
打开|SpeedShop 是另一个可以生成详细程序配置文件的分析器。它同时具有图形用户界面和命令行界面。The Open|由于 DOE 的资助,SpeedShop 工具可以在所有最新的高性能计算系统上运行。它支持 MPI、OpenMP 和 CUDA。打开|Speedshop 是开源的,可以免费下载。他们的网站有详细的用户指南和教程。打开|Speedshop 可以使用 spack install openspeedshop 的 Spack 包管理器进行安装。
Open|SpeedShop is another profiler that can produce detailed program profiles. It has both a graphical user interface and a command-line interface. The Open|SpeedShop tool runs on all the latest high-performance computing systems as a result of DOE funding. It has support for MPI, OpenMP, and CUDA. Open|Speedshop is open source and can be freely downloaded. Their website has detailed user guides and tutorials. Open|Speedshop can be installed with the Spack package manager with spack install openspeedshop.
打开|https://openspeedshop.org Speedshop
Open|Speedshop at https://openspeedshop.org
TAU 是主要由俄勒冈大学开发的分析工具。这个免费提供的工具具有易于使用的图形用户界面。TAU 用于许多最大的高性能计算应用程序和系统。该工具的网站上有大量关于使用 TAU 的文档。TAU 可以使用 Spack 包管理器和 spack install tau 进行安装。
TAU is a profiling tool developed primarily at the University of Oregon. This freely available tool has a graphical user interface that is easy to use. TAU is used on many of the largest high-performance computing applications and systems. There is extensive documentation on using TAU at the tool’s website. TAU can be installed with the Spack package manager with spack install tau.
性能研究实验室(俄勒冈大学),地址为 http://www.cs.uoregon.edu/ research/tau/home.php
Performance Research Lab (University of Oregon) at http://www.cs.uoregon.edu/ research/tau/home.php
我们在第 3 章中指出了基准测试和迷你应用程序在评估应用程序性能方面的价值。基准测试更适合衡量系统的性能。迷你应用程序更侧重于应用领域以及如何最好地为各种架构实现算法,但这些之间的区别有时可能会变得模糊。
We noted the value of benchmarks and mini-apps for assessing the performance of your applications in chapter 3. Benchmarks are more appropriate for measuring the performance of a system. Mini-apps are more focused on application areas and how best to implement the algorithms for various architectures, but the difference between these can be blurred at times.
以下是基准列表,这些基准可以作为衡量潜在系统性能的有用指标。我们在性能研究中广泛使用了 STREAM 基准测试,但可能有更适合您的应用的基准测试。例如,如果您的应用程序从分散的内存位置加载单个数据值,则 Random 基准测试将是最合适的。
The following is a list of benchmarks that can be useful measures for your potential system performance. We have extensively used the STREAM Benchmark in our performance studies, but there may be more appropriate benchmarks for your application. For example, if your application loads a single data value from scattered memory locations, the Random benchmark would be the most appropriate.
http://www.netlib.org/benchmark/hpl/ 的 Linpack - 用于 Top 500 High Performance Computers 列表。
Linpack at http://www.netlib.org/benchmark/hpl/—Used for the Top 500 High Performance Computers list.
https://www.cs.virginia.edu/stream/ref.html 处的 STREAM — 内存带宽的基准。您可以在 https:// github.com/jeffhammond/STREAM.git 的 Git 存储库中找到版本。
STREAM at https://www.cs.virginia.edu/stream/ref.html—A benchmark for memory bandwidth. You can find a version in the Git repository at https:// github.com/jeffhammond/STREAM.git.
Random at http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/ (随机内存访问性能) - 随机内存访问性能的基准。
Random at http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/—A benchmark for random memory access performance.
http://www.nas.nasa.gov/publications/npb.html 的 NAS 并行基准测试 — NASA 基准测试于 1991 年首次发布,包括一些最常用的研究基准测试。
NAS Parallel Benchmarks at http://www.nas.nasa.gov/publications/npb.html—NASA benchmarks, first released in 1991, include some of the most heavily used benchmarks for research.
http://www.hpcg-benchmark.org/software/ HPC — 作为 Linpack 的替代方案而开发的新共轭梯度基准。HPCG 为当前算法和计算机提供了更真实的性能基准。
HPCG at http://www.hpcg-benchmark.org/software/—New conjugate gradient benchmark developed as an alternative to Linpack. HPCG gives a more realistic performance benchmark for current algorithms and computers.
http://icl.cs.utk.edu/hpcc/ HPC 挑战赛基准测试 - 复合基准测试。
HPC Challenge Benchmark at http://icl.cs.utk.edu/hpcc/—A composite benchmark.
https://github.com/ParRes/Kernels 并行研究内核 - 来自典型科学模拟代码和多个并行实现的各种小内核。
Parallel Research Kernels at https://github.com/ParRes/Kernels—Various small kernels from typical scientific simulation codes and in several parallel implementations.
应用程序不得不针对新架构进行许多调整。通过使用迷你应用程序,您可以突出显示简单应用程序类型在目标系统上的性能。本节介绍了由能源部 (DOE) 实验室开发的迷你应用程序列表,这些应用程序可能是您的应用程序的有价值的参考实现。
Applications have had to make many adaptations for new architectures. With the use of mini-apps, you can highlight the performance of a simple application type on a target system. This section presents a list of mini-apps developed by the Department of Energy (DOE) laboratories, which might be a valuable reference implementation for your application.
DOE 实验室的任务是开发百万兆次级计算机,这些计算机提供了高性能计算的前沿。这些实验室为硬件设计人员和应用程序开发人员创建了迷你应用程序和代理应用程序,以试验如何充分利用这些百万兆次级系统。这些迷你应用程序中的每一个都有不同的目的。有些反映了大型应用程序的性能,而另一些则用于算法探索。首先,让我们定义几个术语来帮助我们对迷你应用程序进行分类。
The DOE laboratories have been tasked with the development of exascale computers, which provide the leading edge of high-performance computing. These laboratories have created mini-apps and proxy applications for hardware designers and application developers to experiment with how to get the most out of these exascale systems. Each of these mini-apps has a different purpose. Some reflect the performance of a large application, while others are meant for algorithmic exploration. To begin, let’s define a couple of terms to help us categorize the mini-apps.
代理迷你应用程序 - 捕获其性能特征的大型应用程序的 Extract 或较小形式。代理对于协同设计过程中的硬件供应商非常有用,因为它们是一个较小的应用程序,他们可以在硬件设计过程中使用。
Proxy mini-app—An extract or smaller form of a larger application that captures its performance characteristics. Proxies are useful to hardware vendors in a co-design process as a smaller application that they can use in the hardware design process.
Research mini-app—A simpler form of a computational approach that is useful for researchers to explore alternative algorithms and methods for improved performance and new architectures.
迷你应用程序的分类并不完美。迷你应用程序的每个作者都有自己的创建原因,这通常不适合整齐的类别。
The categorization of mini-apps is not perfect. Each author of a mini-app has their own reason for their creation, which often doesn’t fit into neat categories.
Exascale Project proxy apps: A cross-section of sample applications
DOE 开发了一些示例应用程序,用于基准测试系统、性能实验和算法开发。其中许多活动是由 https://proxyapps.exascaleproj ect.org/ 的 DOE 百万兆次级计算项目组织的。
The DOE has developed some sample applications for use in benchmarking systems, performance experiments, and algorithm development. Many of these have been organized by the DOE Exascale Computing Project at https://proxyapps.exascaleproj ect.org/.
ExaMiniMD—Proxy application for particle and molecular dynamics codes
NEKbone—Incompressible Navier-Stokes solver using spectral elements
Thornado-mini—Finite element, moment-based radiation transport
百万兆次级项目代理应用程序是从国家实验室开发的众多代理应用程序中选出的。在以下部分中,我们列出了由各个国家实验室为科学应用开发的其他代理和微型应用程序,这些应用程序对其研究实验室的任务非常重要。这些应用程序作为国家协同设计战略的一部分提供给公众和硬件开发人员。协同设计过程是硬件开发人员和应用程序开发人员在反馈循环中密切合作的地方,该反馈循环迭代这些百万兆次级系统的功能。
The Exascale Project proxy applications are selected from the many proxy applications developed by national laboratories. In the following sections, we list other proxy and mini-applications developed by various national laboratories for scientific applications that are important to the mission for their research laboratory. These applications are made available to the public and hardware developers as part of the national codesign strategy. The codesign process is where hardware developers and application developers work closely together in a feedback loop that iterates the features of these exascale systems.
通常,这些小程序所镜像的应用程序往往是专有的,因此不能在相应的实验室之外共享。随着其中一些迷你应用程序的发布,我们认识到当前的应用程序更加复杂,并且与以前可用的简单内核相比,对硬件的压力不同。
Often, the applications that these mini-apps mirror tend to be proprietary and, therefore, cannot be shared outside the corresponding laboratory. With the release of some of these mini-apps, we recognize that current applications are more complex and stress the hardware in different ways than the simple kernels previously available.
Lawrence Livermore National Laboratory proxies
劳伦斯利弗莫尔国家实验室一直是代理开发的主要支持者之一。他们的 LULESH 代理是供应商和学术研究人员研究最多的代理之一。一些劳伦斯利弗莫尔国家实验室代理包括
Lawrence Livermore National Laboratory has been one of the leading proponents of proxy development. Their LULESH proxy is one of the most heavily studied by vendors and academic researchers. Some of the Lawrence Livermore National Laboratory proxies include
有关劳伦斯利弗莫尔国家实验室代理的更多详细信息,请访问他们的网站 https://computing.llnl.gov/projects/co-design/proxy-apps。
For more detail on the Lawrence Livermore National Laboratory proxies, see their website at https://computing.llnl.gov/projects/co-design/proxy-apps.
Los Alamos National Laboratory proxy applications
洛斯阿拉莫斯国家实验室也有许多有趣的代理应用程序。这里列出了一些更受欢迎的。
Los Alamos National Laboratory also has many interesting proxy applications. Some of the more popular are listed here.
有关洛斯阿拉莫斯国家实验室代理的更多详细信息,请访问他们的网站 https://www.lanl.gov/projects/codesign/proxy-apps/lanl/index.php。
For more detail on the Los Alamos National Laboratory proxies, see their website at https://www.lanl.gov/projects/codesign/proxy-apps/lanl/index.php.
Sandia National Laboratories Mantevo 迷你应用程序套件
Sandia National Laboratories Mantevo suite of mini-apps
桑迪亚国家实验室 (Sandia National Laboratories) 推出了一个名为 Mantevo 的品牌迷你应用程序套件,其中包括他们的迷你应用程序以及来自英国原子武器机构 (AWE) 等其他组织的一些应用程序。以下是他们的小程序列表:
Sandia National Laboratories has put together a branded mini-app suite called Mantevo, which includes their mini-apps and a few from other organizations such as the United Kingdom’s Atomic Weapons Establishment (AWE). Here is list of their mini-apps:
CloverLeaf—Cartesian grid compressible fluids hydrocode mini-app
miniFE—Proxy application for unstructured implicit finite element codes
TeaLeaf—Proxy application for unstructured implicit finite element codes
有关 Mantevo min-app 套件的更多信息,请访问 https://mantevo .github.io。
More information on the Mantevo min-app suite is available at https://mantevo .github.io.
对于稳健的应用程序,您需要一个工具来检测和报告内存错误。在本节中,我们将讨论许多检测和报告内存错误的工具的功能以及优缺点。应用程序中发生的内存错误可以分为以下几类:
For robust applications, you need a tool to detect and report memory errors. In this section, we discuss the capabilities and the pros and cons of a number of tools that detect and report memory errors. The memory errors that occur in applications can be broken down into these categories:
Out-of-bound errors(越界错误) – 尝试访问超出数组边界的内存。Fence-post 检查器和一些编译器可以捕获这些错误。
Out-of-bound errors—Attempting to access memory beyond the array bounds. Fence-post checkers and some compilers can catch these errors.
Memory leaks—Allocating memory and never freeing it. Malloc replacement tools are good at catching and reporting memory leaks.
Uninitialized memory (未初始化的内存) - 在设置内存之前使用的内存。因为 memory 在使用之前没有设置,所以它具有 memory 中以前使用的任何值。结果是应用程序的行为可能因运行而异。这种类型的错误很难发现,专门为捕获这些错误而设计的工具是必不可少的。
Uninitialized memory—Memory that is used before it is set. Because memory is not set before its use, it has whatever value is in memory from previous use. The result is that the behavior of the application can vary from run to run. This type of error is difficult to find, and tools specifically designed to catch these are essential.
只有少数工具可以处理所有这些类别的内存错误。大多数工具在某种程度上处理前两类。未初始化的内存检查是一项重要的检查,只有少数工具支持。我们将首先介绍这些工具。
Only a few tools handle all of these categories of memory errors. Most of the tools handle the first two categories to some degree. Uninitialized memory checks are an important check and supported by just a few tools. We’ll cover those tools first.
Valgrind 使用其默认的 Memcheck 工具检查未初始化的内存。我们首先在 2.1.3 节中介绍了 Valgrind 。Valgrind 是一个不错的选择,因为它是开源且免费提供的,还因为它是检测所有三个类别中的内存错误的最佳工具之一。
Valgrind checks uninitialized memory with its default Memcheck tool. We first presented Valgrind for this purpose in section 2.1.3. Valgrind is a good choice both because it is open source and freely available and because it is one of the best tools at detecting memory errors in all three categories.
最好将 Valgrind 与 GCC 编译器一起使用。GCC 团队将其用于开发,因此清理了他们生成的代码,以便他们的串行应用程序不需要误报的抑制文件。对于并行应用程序,您还可以使用 OpenMPI 软件包提供的抑制文件来抑制 Valgrind with OpenMPI 检测到的误报。例如
It’s best to use Valgrind with the GCC compiler. The GCC team uses it for their development, and as a result cleaned up their generated code so that a suppression file for false positives is not needed for their serial applications. For parallel applications, you can also suppress the false positives detected by Valgrind with OpenMPI by using a suppression file provided by the OpenMPI package. For example
mpirun -n 4 valgrind \ --suppressions=$MPI_DIR/share/openmpi/openmpi-valgrind.supp <my_app>
mpirun -n 4 valgrind \ --suppressions=$MPI_DIR/share/openmpi/openmpi-valgrind.supp <my_app>
只有几个命令行选项,Valgrind 工具通常会建议在其报告中使用哪些选项。有关使用的更多信息,请参阅 Valgrind 网站 (https://valgrind.org)。
There are only a few command-line options, and the Valgrind tool often suggests which options to use in its report. For more information on the usage, see the Valgrind website (https://valgrind.org).
是的,真的,这就是它的名字。Dr. Memory 是一个类似于 Valgrind 的工具,但更新更快。与 Valgrind 一样,Dr. Memory 可以检测程序中的内存错误和问题。它是一个开源项目,可在各种芯片架构和操作系统上免费使用。
Yes, really, that’s its name. Dr. Memory is a similar tool to Valgrind but newer and faster. Like Valgrind, Dr. Memory detects memory errors and problems within your program. It is an open source project, freely available across a variety of chip architectures and operating systems.
除了 Dr. Memory 之外,这套运行时工具中还有许多其他工具。由于 Dr. Memory 是一个相对简单的工具,因此我们将提供一个如何使用它的快速示例。让我们首先设置 Dr. Memory 以供使用。
There are many other tools besides Dr. Memory in this suite of run-time tools. Because Dr. Memory is a relatively simple tool, we’ll present a quick example of how to use it. Let’s first set up Dr. Memory for use.
我们将在 https://github.com/ EssentialsofParallelComputing/Chapter17 的存储库中的示例中试用 Dr. Memory。下面的清单是第 4 章清单 4.1 中的代码副本。该代码只是一个片段,用于检查语法是否正确编译。
We’ll try out Dr. Memory on the example in the repository at https://github.com/ EssentialsofParallelComputing/Chapter17. The following listing is a copy of the code from listing 4.1 of chapter 4. The code is just a fragment to check that the syntax correctly compiles.
Listing 17.3 DrMemory test example
DrMemory/memoryexample.c
1 #include <stdlib.h>
2
3 int main(int argc, char *argv[])
4 {
5 int j, imax, jmax;
6
7 // first allocate a column of pointers of type pointer to double
8 double **x = (double **) ❶
malloc(jmax * sizeof(double *)); ❶
9
10 // now allocate each row of data
11 for (j=0; j<jmax; j++){ ❷
12 x[j] = (double *)malloc(imax * sizeof(double));
13 }
14 }DrMemory/memoryexample.c
1 #include <stdlib.h>
2
3 int main(int argc, char *argv[])
4 {
5 int j, imax, jmax;
6
7 // first allocate a column of pointers of type pointer to double
8 double **x = (double **) ❶
malloc(jmax * sizeof(double *)); ❶
9
10 // now allocate each row of data
11 for (j=0; j<jmax; j++){ ❷
12 x[j] = (double *)malloc(imax * sizeof(double));
13 }
14 }
❶ Uninitialized memory read of variable jmax
运行此示例只需几个命令。从本章的补充示例中检索代码并构建它:
Running this example takes just a few commands. Retrieve the code from the supplemental examples for the chapter and build it:
git clone --recursive \
https:/ /github.com/EssentialsofParallelComputing/Chapter17
cd DrMemory
makegit clone --recursive \
https:/ /github.com/EssentialsofParallelComputing/Chapter17
cd DrMemory
make
现在通过执行 drmemory 来运行示例,后跟两个短划线,然后是可执行文件的名称:drmemory— memoryexample。图 17.2 显示了 Dr. Memory 生成的报告。
Now run the example by executing drmemory, followed by two dashes and then the name of the executable: drmemory— memoryexample. Figure 17.2 shows the report that Dr. Memory produces.
图 17.2 Dr. Memory 报告显示第 11 行出现未初始化的读取,第 8 行出现内存泄漏。
Figure 17.2 Report from Dr. Memory shows that an uninitialized read at line 11 and a memory leak for memory allocated at line 8.
Dr. Memory 在第 11 行使用时正确标记了 jmax 未初始化。它还显示了第 12 行的泄漏。为了解决这些问题,我们初始化 jmax,然后释放每个 x[j] 指针和 x 数组,然后使用 drmemory— memoryexample 重试。图 17.3 显示了该报告。
Dr. Memory correctly flags that jmax was not initialized when used on line 11. It also shows a leak on line 12. To fix these, we initialize jmax and then free each x[j] pointer and the x array, then try again with drmemory— memoryexample. Figure 17.3 shows the report.
图 17.3 此 Dr. Memory 报告显示未初始化的内存错误和泄漏已修复。
Figure 17.3 This Dr. Memory report shows that the uninitialized memory error and the leak are fixed.
图 17.3 中来自 Dr. Memory 的报告显示修复后没有错误。请注意,Dr. Memory 不会标记 imax 未初始化。有关适用于 Windows、Linux 和 Mac 的 Dr. Memory 的更多信息,请参阅 https://drmemory.org。
The report from Dr. Memory in figure 17.3 shows no errors after our fix. Note that Dr. Memory does not flag that imax is uninitialized. For more information on Dr. Memory for Windows, Linux, and Mac, see https://drmemory.org.
Purify 和 Insure++ 是检测内存错误(包括某种形式的未初始化内存检查)的商业工具。TotalView 在其最新版本中包含一个内存检查器。如果您有一个要求苛刻的应用程序,需要极高质量的代码,并且您正在寻找内存检查工具的供应商支持,那么这些商业工具之一可能是一个不错的选择。
Purify and Insure++ are commercial tools that detect memory errors, including some form of uninitialized memory check. TotalView includes a memory checker in its most recent versions. If you have a demanding application that requires extreme quality code, and you are looking for vendor support for your memory checking tool, one of these commercial tools may be a good choice.
许多编译器正在将内存工具整合到他们的产品中。LLVM 编译器具有一组工具,其中包括内存检查器功能。这包括 MemorySanitizer、AddressSanitizer 和 ThreadSanitizer。GCC 包括检测内存泄漏的 mtrace 组件。
Many compilers are incorporating memory tools into their products. The LLVM compiler has a set of tools that includes memory checker functionality. This includes MemorySanitizer, AddressSanitizer, and ThreadSanitizer. GCC includes the mtrace component which detects memory leaks.
一些工具在内存分配之前和之后放置内存块,以检测越界内存访问并跟踪内存泄漏。这些类型的内存检查器称为 fence-post 内存检查器。这些是相当简单的实现工具,通常以库的形式提供。此外,这些工具是可移植的,并且易于添加到常规回归测试系统中。
Several tools place blocks of memory before and after memory allocations to detect out-of-bounds memory accesses and to also track memory leaks. These types of memory checkers are referred to as fence-post memory checkers. These are fairly simple tools to implement and are usually provided as a library. Additionally, these tools are portable and easy to add to a regular regression testing system.
在这里,我们详细讨论了 dmalloc 以及如何使用 fence-post 内存检查器。Electric Fence 和 Memwatch 是另外两个提供 fence-post 内存检查的软件包,它们具有类似的使用模型,但 dmalloc 是最著名的 fence-post 内存检查器。它将 malloc 库替换为提供内存检查的版本。
Here we discuss dmalloc in detail and how to use a fence-post memory checker. Electric Fence and Memwatch are two other packages that provide fence-post memory checks and have an analogous use model, but dmalloc is the best known fence-post memory checker. It replaces the malloc library with a version that provides memory checking.
对于下面清单中的源代码,我们在第 3 行添加了带有 include 指令的 dmalloc 头文件,以便我们在报告中获取行号。
For our source code in the following listing, we added the dmalloc header file with an include directive on line 3 so that we get line numbers in our report.
Listing 17.4 Dmalloc example code
Dmalloc/mallocexample.c 1 #include <stdlib.h> 2 #ifdef DMALLOC 3 #include "dmalloc.h" ❶ 4 #endif 5 6 int main(int argc, char *argv[]) 7 { 8 int imax=10, jmax=12; 9 10 // first allocate a block of memory for the row pointers 11 double *x = (double *)malloc(imax*sizeof(double *)); 12 13 // now initialize the x array to zero 14 for (int i = 0; i < jmax; i++) { ❷ 15 x[i] = 0.0; ❷ 16 } 17 free(x); 18 return(0); 19 }
Dmalloc/mallocexample.c 1 #include <stdlib.h> 2 #ifdef DMALLOC 3 #include "dmalloc.h" ❶ 4 #endif 5 6 int main(int argc, char *argv[]) 7 { 8 int imax=10, jmax=12; 9 10 // first allocate a block of memory for the row pointers 11 double *x = (double *)malloc(imax*sizeof(double *)); 12 13 // now initialize the x array to zero 14 for (int i = 0; i < jmax; i++) { ❷ 15 x[i] = 0.0; ❷ 16 } 17 free(x); 18 return(0); 19 }
❶ Includes the dmalloc header file
❷ Writes past the end of the x array
我们在第 14 行和第 15 行的 x 数组上包含了越界访问。现在我们可以构建并运行我们的可执行文件:
We’ve included an out-of-bounds access on the x array on lines 14 and 15. Now we can build our executable and run it:
make ./mallocexample
make ./mallocexample
But the output to the terminal reports a failure:
debug-malloc library: dumping program, fatal error Error: failed OVER picket-fence magic-number check (err 27) Abort trap: 6
debug-malloc library: dumping program, fatal error Error: failed OVER picket-fence magic-number check (err 27) Abort trap: 6
让我们从图 17.4 所示的日志文件中获取有关该问题的更多信息。
Let’s get more information about the problem from the log file shown in figure 17.4.
图 17.4 dmalloc 日志文件在第 11 行显示越界内存访问。
Figure 17.4 The dmalloc log file shows an out-of-bounds memory access at line 11.
Dmalloc 已检测到越界访问。伟大!您可以在 dmalloc 的网站 (https://dmalloc.com) 上找到有关 dmalloc 的更多信息。
Dmalloc has detected the out-of-bounds access. Great! You can find more information on dmalloc at its website (https://dmalloc.com).
GPU 供应商正在开发内存工具,用于检测在其硬件上运行的应用程序的内存错误。NVIDIA 已经发布了相应的工具,其他 GPU 供应商肯定会效仿。NVIDIA CUDA-MEMCHECK 工具检查越界内存引用、数据争用检测、同步使用错误和未初始化的内存。该工具可以作为独立命令运行:
GPU vendors are developing memory tools for detecting memory errors for applications running on their hardware. NVIDIA has released a corresponding tool, and other GPU vendors are sure to follow. The NVIDIA CUDA-MEMCHECK tool checks for out-of-bounds memory references, data race detections, synchronization usage errors, and uninitialized memory. The tool can be run as a standalone command:
cuda-memcheck [--tool memcheck|racecheck|initcheck|synccheck] <app_name>
cuda-memcheck [--tool memcheck|racecheck|initcheck|synccheck] <app_name>
Documentation on the tool usage is available on the NVIDIA website:
CUDA-MEMCHECK,CUDA 工具包文档,位于 https://docs.nvidia.com/ cuda/cuda-memcheck/index.html
CUDA-MEMCHECK, CUDA Toolkit Documentation at https://docs.nvidia.com/ cuda/cuda-memcheck/index.html
用于检测线程争用条件(也称为数据危害)的工具在开发 OpenMP 应用程序时至关重要。如果没有竞争检测工具,就不可能开发强大的 OpenMP 应用程序。然而,很少有工具可以检测竞争条件。两个有效的工具是 Intel Inspector 和 Archer,我们接下来将讨论它们。
Tools to detect thread race conditions (also called data hazards) are critical in developing OpenMP applications. It is impossible to develop robust OpenMP applications without a race detection tool. Yet, there are few tools that can detect race conditions. Two tools that are effective are Intel Inspector and Archer, which we discuss next.
Intel® Inspector 是一种具有图形用户界面的工具,可有效检测 OpenMP 代码中的争用条件。我们在前面的 7.9 节中讨论了 Intel Inspector。虽然 Inspector 是 Intel 专有工具,但它现在是免费提供的。在 Ubuntu 上,可以从 Intel 的 OneAPI 套件安装它:
Intel® Inspector is a tool with a graphical user interface that is effective at detecting race conditions in OpenMP code. We discussed Intel Inspector earlier in section 7.9. Though Inspector is an Intel proprietary tool, it is now freely available. On Ubuntu, it can be installed from the OneAPI suite from Intel:
wget -q https:/ /apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
echo "deb https:/ /apt.repos.intel.com/oneapi all main" >>
/etc/apt/sources.list.d/oneAPI.list
echo "deb [trusted=yes arch=amd64] https:/ /repositories.intel.com/graphics/ubuntu bionic
main" >> /etc/apt/sources.list.d/intel-graphics.list
apt-get install intel-oneapi-inspectorwget -q https:/ /apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
echo "deb https:/ /apt.repos.intel.com/oneapi all main" >>
/etc/apt/sources.list.d/oneAPI.list
echo "deb [trusted=yes arch=amd64] https:/ /repositories.intel.com/graphics/ubuntu bionic
main" >> /etc/apt/sources.list.d/intel-graphics.list
apt-get install intel-oneapi-inspector
Archer 是基于 LLVM 的 ThreadSanitizer (TSan) 构建的开源工具,适用于检测 OpenMP 中的线程争用情况。使用 Archer 工具基本上只是将编译器命令替换为 clang-archer,并使用 -larcher 链接 Archer 库中。Archer 将其报告输出为文本。
Archer is an open source tool built on LLVM’s ThreadSanitizer (TSan) and adapted for detecting thread race conditions in OpenMP. Using the Archer tool is basically just replacing the compiler command with clang-archer and linking in the Archer library with -larcher. Archer outputs its report as text.
您可以使用 LLVM 编译器手动安装 Archer,也可以使用 spack install archer 通过 Spack 软件包管理器进行安装。我们在 https://github.com/EssentialsofParallelComputing/ Chapter17 中提供了一些构建脚本和随附的示例以进行安装。安装 Archer 工具后,您可以在示例的 Archer 子目录中构建我们的示例。在此示例中,我们使用第 7.3.3 节中的模板代码之一。然后,我们将编译器命令更改为 clang-archer,并将 Archer 库添加到 link 命令中,从而修改 CMake 构建系统,如下面的清单所示。
You can manually install Archer with the LLVM compiler, or install with the Spack package manager using spack install archer. We have included some build scripts with the accompanying examples at https://github.com/EssentialsofParallelComputing/ Chapter17 for installation. Once the Archer tool is installed, you can build our example in the Archer subdirectory of the examples. In the example, we use one of the stencil codes from section 7.3.3. We then modify the CMake build system by changing the compiler command to clang-archer and by adding the Archer libraries to the link command as the following listing shows.
Listing 17.5 Archer example code
Archer/CMakeLists.txt 1 cmake_minimum_required (VERSION 3.0) 2 project (stencil) 3 4 set (CC clang-archer) ❶ 5 6 set (CMAKE_C_STANDARD 99) 7 8 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -g -O3") 9 10 find_package(OpenMP) 11 12 # Adds build target of stencil with source code files 13 add_executable(stencil stencil.c timer.c timer.h malloc2D.c malloc2D.h) 14 set_target_properties(stencil PROPERTIES COMPILE_FLAGS ${OpenMP_C_FLAGS}) 15 set_target_properties(stencil PROPERTIES LINK_FLAGS "${OpenMP_C_FLAGS} -L${HOME}/archer/lib -larcher") ❷
Archer/CMakeLists.txt 1 cmake_minimum_required (VERSION 3.0) 2 project (stencil) 3 4 set (CC clang-archer) ❶ 5 6 set (CMAKE_C_STANDARD 99) 7 8 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -g -O3") 9 10 find_package(OpenMP) 11 12 # Adds build target of stencil with source code files 13 add_executable(stencil stencil.c timer.c timer.h malloc2D.c malloc2D.h) 14 set_target_properties(stencil PROPERTIES COMPILE_FLAGS ${OpenMP_C_FLAGS}) 15 set_target_properties(stencil PROPERTIES LINK_FLAGS "${OpenMP_C_FLAGS} -L${HOME}/archer/lib -larcher") ❷
❶ Sets the compiler command to clang-archer
❷ Adds the archer libraries to LINK_FLAGS
Compile the code and run it is as before:
mkdir build && cd build cmake .. make ./stencil
mkdir build && cd build cmake .. make ./stencil
我们将 Archer 工具输出与正常输出混合在一起,如图 17.5 所示。
We get the Archer tool output mixed in with the normal output as figure 17.5 shows.
Figure 17.5 Output from the Archer data race detection tool
有一些启动时报告的争用条件报告似乎是误报,但在运行期间没有其他消息。有关更多信息,请查看以下文档:
There are some reports of race conditions reported at startup that appear to be false positives, but no additional messages during the run. For more information, check out the following documentation:
“Archer PRUNERS:为发现超级计算机运行中的非确定性错误提供可重复性”(2017 年),https://pruners.github.io/archer/
“Archer PRUNERS: Providing Reproducibility for Uncovering Non-deterministic Errors in Runs on Supercomputers” (2017), https://pruners.github.io/archer/
位于 https://github.com/PRUNERS/archer 的 Archer 存储库
Archer repository at https://github.com/PRUNERS/archer
您将应用程序开发的大部分时间用于修复 Bug。在并行应用程序开发中尤其如此。任何有助于此过程的工具都至关重要。并行程序员还需要针对处理多个进程和线程的额外功能。
You spend much of your application development time fixing bugs. This is especially true in parallel application development. Any tool that helps with this process is vitally important. Parallel programmers also need additional capabilities targeted at dealing with multiple processes and threads.
用于高性能计算站点的大型并行应用程序的调试器通常包括几个商业产品。这包括功能强大且易于使用的 TotalView 和 Arm DDT 调试器。但大多数代码开发最初是在笔记本电脑、台式机和大型中心以外的本地集群上完成的,因此您可能无法访问这些较小系统上的商业调试器。可用于较小集群、台式机和笔记本电脑的非商业调试器在并行编程功能方面受到更多限制,并且更难使用。在本节中,我们首先讨论商业调试器。
The debuggers used for large parallel applications at high performance computing sites generally include a couple of commercial offerings. This includes the powerful and easy-to-use TotalView and Arm DDT debuggers. But most code development is initially done on laptops, desktops, and local clusters outside of large centers, so you may not have access to a commercial debugger on these smaller systems. The non-commercial debuggers available for smaller clusters, desktops, and laptops are more limited in parallel programming features and harder to use. In this section, we begin with a discussion of the commercial debuggers.
TotalView 为领先的高性能计算系统(包括 MPI 和 OpenMP 线程)提供广泛支持。TotalView 对使用 CUDA 调试 NVIDIA GPU 有一些支持。它使用图形用户界面,易于导航;它还具有很深的功能,需要一些探索。TotalView 通常通过在命令行前加上 totalview 来调用。-a 标志指示将其余参数传递给应用程序:
TotalView has extensive support for leading high performance computing systems, including MPI and OpenMP threading. TotalView has some support for debugging NVIDIA GPUs using CUDA. It uses a graphical user interface and is easy to navigate; it also has a great depth of features that take some exploration. TotalView is generally invoked by prefixing the command line with totalview. The -a flag indicates that the rest of the arguments are to be passed to the application:
totalview mpirun -a -n 4 <my_application>
totalview mpirun -a -n 4 <my_application>
Lawrence Livermore National Laboratory 在 Totalview 上有一个很好的教程。详细信息可在 TotalView 网站上获得:
Lawrence Livermore National Laboratory has a good tutorial on Totalview. Detailed information is available at the TotalView websites:
TotalView(劳伦斯利弗莫尔国家实验室),https://computing .llnl.gov/tutorials/totalview/
TotalView (Lawrence Livermore National Laboratory) at https://computing .llnl.gov/tutorials/totalview/
https://totalview.io 处的 TotalView (Perforce)
TotalView (Perforce) at https://totalview.io
ARM DDT 调试器是高性能计算站点使用的另一种流行的商用调试器。它对 MPI 和 OpenMP 有广泛的支持。它还对调试 CUDA 代码有一些支持。DDT 调试器使用非常直观的图形用户界面。此外,DDT 还支持远程调试。在这种情况下,图形客户端界面在本地系统上运行,并且正在调试的应用程序在高性能计算系统上远程启动。要使用 DDT 启动调试会话,只需在命令行前面加上 ddt 即可:
The ARM DDT debugger is another popular commercial debugger used at high performance computing sites. It has extensive support for MPI and OpenMP. It also has some support for debugging CUDA code. The DDT debugger uses a graphical user interface that is very intuitive. In addition, DDT has support for remote debugging. In this case, the graphical client interface is run on your local system, and the application that is being debugged is remotely launched on the high performance computing system. To start a debug session with DDT, just prepend ddt to your command line:
ddt <my_application>
ddt <my_application>
Texas Advanced Computing Center 对 DDT 有很好的介绍。DDT 网站上也有更多信息:
The Texas Advanced Computing Center has a good introduction to DDT. There is also more information at the DDT websites:
ARM DDT 调试器教程(TACC,德克萨斯州高级计算中心),网址为 https://portal.tacc.utexas.edu/tutorials/ddt
ARM DDT Debugger tutorials (TACC, Texas Advanced Computing Center) at https://portal.tacc.utexas.edu/tutorials/ddt
https://www.arm.com/products/development-tools/ 服务器和 hpc/forge/ddt 上的 ARM DDT (ARM Forge)
ARM DDT (ARM Forge) at https://www.arm.com/products/development-tools/ server-and-hpc/forge/ddt
标准 Linux 调试器 GDB 在 Linux 平台上无处不在。它的命令行界面需要一些工作来学习。对于串行可执行文件,GDB 使用命令
The standard Linux debugger, GDB, is ubiquitous on Linux platforms. Its command-line interface requires some work to learn. For a serial executable, GDB runs with the command
gdb <my_application>
gdb <my_application>
GDB 没有内置的并行 MPI 支持。您可以通过使用 mpirun 命令启动多个 GDB 会话来调试并行作业。xterms 不能在所有环境中启动,所以这不是一个万无一失的技术。
GDB does not have built-in parallel MPI support. You may be able to debug parallel jobs by launching multiple GDB sessions with the mpirun command. The xterms cannot be launched in all environments, so this is not a fool-proof technique.
mpirun -np 4 xterm -e gdb ./<my_application>
mpirun -np 4 xterm -e gdb ./<my_application>
许多更高级别的用户界面都是在 GDB 之上构建的。其中最简单的是 cgdb,它是一个基于 curses 的界面,与 vi 编辑器非常相似。curses 界面是一个基于字符的 Windows 系统。它的优势是比成熟的位图图形用户界面具有更好的网络性能特征。cgdb 及其文档可在此处广泛获取:
Many higher-level user interfaces are built on top of GDB. The simplest of these is cgdb, which is a curses-based interface that has a strong similarity to the vi editor. The curses interface is a character-based windows system. It has the advantage of better network performance characteristics than a full-fledged, bit-mapped graphical user interface. cgdb is widely available along with its documentation here:
cgdb(curses 调试器)https://cgdb.github.io
cgdb, the curses debugger, at https://cgdb.github.io
DataDisplayDebugger 中提供了 GDB 的完整图形用户界面,称为 DDD。DDD 调试器网站提供了有关 DDD 和其他类似调试器的详细信息:
A full graphical user interface to GDB is available in the DataDisplayDebugger, known as DDD. The DDD debugger website gives more information on DDD and other similar debuggers:
DDD(DataDisplayDebugger)位于 https://www.gnu.org/software/ddd/
DDD, the DataDisplayDebugger, at https://www.gnu.org/software/ddd/
cgdb 和 DDD 都不包含显式并行支持。其他更高级别的用户界面(如 Eclipse IDE)在 GDB 调试器之上提供了一个并行调试器接口。Eclipse IDE 支持多种语言,并为 CPU 和 GPU 的编程工具提供了基础。
Neither cgdb nor DDD includes explicit parallel support. Other higher-level user interfaces such as the Eclipse IDE provide a parallel debugger interface on top of the GDB debugger. The Eclipse IDE is available for a wide range of languages and provides the foundation for programming tools for CPUs and GPUs.
https://www.eclipse.org/ide/ 上的桌面 IDE (Eclipse Foundation)
Desktop IDEs (Eclipse Foundation) at https://www.eclipse.org/ide/
用于开发 GPU 代码的调试器的可用性是一个关键的游戏规则改变因素。GPU 代码的开发因在 GPU 上调试的困难而受到严重阻碍。本节中讨论的 GPU 调试工具仍不成熟,但非常需要任何功能。这些 GPU 调试器大量利用了上一节中介绍的 GDB 和 DDD 等开源工具。
The availability of debuggers for the development of GPU code is a critical game changer. The development of GPU code has been seriously hampered by the difficulty of debugging on GPUs. The GPU debugging tools discussed in this section are still immature, but any capability is sorely needed. These GPU debuggers heavily leverage the open source tools such as GDB and DDD introduced in the previous section.
CUDA-GDB: A debugger for the NVIDIA GPUs
CUDA 有一个基于 GDB 的命令行调试器,称为 CUDA-GDB。NVIDIA 的 Nsight™ Eclipse 工具中还有一个带有图形用户界面的 CUDA-GDB 版本,作为其 CUDA 工具包的一部分。CUDA-GDB 也已集成到 DDD 和 Emacs 中。要将 CUDA-GDB 与 DDD 一起使用,请使用 ddd 启动 DDD—调试器 cuda-gdb。您可以在 https://docs.nvidia.com/cuda/cuda-gdb/ 中找到 CUDA-GDB 文档。
CUDA has a command-line debugger based on GDB called CUDA-GDB. There is also a version of CUDA-GDB with a graphical user interface in NVIDIA’s Nsight™ Eclipse tool as part of their CUDA toolkit. CUDA-GDB has also been integrated into DDD and Emacs. To use CUDA-GDB with DDD, launch DDD using ddd—debugger cuda-gdb. You’ll find the CUDA-GDB documentation at https://docs.nvidia.com/cuda/cuda-gdb/.
ROCgdb: A debugger for the Radeon GPUs
AMD ROCm 调试器是 Radeon 开放计算计划的一部分,基于 GDB 调试器,但最初支持 AMD GPU。ROCm 网站有关于 ROCgdb 的文档,但它与 GDB 调试器基本相同。
The AMD ROCm debugger, part of the Radeon Open Compute initiative, is based on the GDB debugger but with initial support for the AMD GPUs. The ROCm website has documentation on ROCgdb, but it is largely the same as the GDB debugger.
AMD ROCm 调试器的站点位于 https://rocmdocs.amd.com/en/ latest/ROCm_Tools/ROCgdb.html。
The site for the AMD ROCm debugger is at https://rocmdocs.amd.com/en/ latest/ROCm_Tools/ROCgdb.html.
ROCm 网站 https://rocmdocs.amd.com。
The ROCm website is https://rocmdocs.amd.com.
在 https://github.com/RadeonOpenCompute/ROCm/blob/master/Debugging %20with%20ROCGDB%20User%20Guide%20v4.1.pdf 的 ROCgdb 用户指南中检查 ROCm 调试器的更新。
Check for updates for the ROCm debugger in the ROCgdb User Guide at https://github.com/RadeonOpenCompute/ROCm/blob/master/Debugging %20with%20ROCGDB%20User%20Guide%20v4.1.pdf.
文件系统性能通常是高性能计算应用程序开发后才考虑的。在当今的大数据世界中,由于文件系统性能落后于计算系统的其他部分,因此文件系统性能是一个日益严重的问题。用于测量文件系统性能的必要工具很少。Darshan 工具就是为了填补这一空白而开发的。Darshan 是一种 HPC I/O 特征化工具,专门用于分析应用程序对文件系统的使用情况。自发布以来,Darshan 已在高性能计算中心得到广泛使用。
Filesystem performance is often an afterthought with high performance computing application development. In today’s world of big data, and with filesystem performance lagging other parts of the computing system, filesystem performance is a growing issue. The necessary tools for measuring filesystem performance are scarce. The Darshan tool was developed to fill this gap. Darshan, an HPC I/O characterization tool, specializes at profiling an application’s use of the filesystem. Since its release, Darshan has achieved widespread use at high performance computing centers.
我们对 https://github.com/EssentialsofParallelComputing/Chapter17.git 的 MPI_IO_Examples/mpi_io_ block2d 目录中的 CMakeLists.txt 文件进行了这些更改。这与我们在第 16.3 节中介绍的 MPI-IO 示例相同,但具有更大的 1000x1000 网格,并且验证码被注释掉。现在,您可以像以前一样构建并运行可执行文件:
We made these changes to the CMakeLists.txt file in the MPI_IO_Examples/mpi_io_ block2d directory at https://github.com/EssentialsofParallelComputing/Chapter17.git. This is the same MPI-IO example we presented in section 16.3 but with a larger 1000x1000 mesh and with the verification code commented out. Now you can build and run the executable as before:
mkdir build && cd build cmake .. make mpirun -n 4 mpi_io_block2d
mkdir build && cd build cmake .. make mpirun -n 4 mpi_io_block2d
您应该可以在 ~/darshan-logs 子目录中找到按日期组织的 Darshan 日志。
You should find the Darshan logs organized by date in your ~/darshan-logs subdirectories.
Darshan 分析工具以可移植文档格式 (PDF) 输出有关应用程序中文件操作的几页文本和图形信息。我们在图 17.6 中显示了部分输出。
The Darshan analysis tool outputs a few pages of text and graphics information on the file operations in your application in portable document format (PDF). We show a part of the output in figure 17.6.
图 17.6 这些图表是 Darshan I/O 特性化工具输出的一部分。同时显示了标准 IO (POSIX) 和 MPI-IO。从右上角的图表中,我们可以确认 MPI-IO 使用的是集体运算,而不是独立运算。
Figure 17.6 The graphs are part of the output from the Darshan I/O characterization tool. Both the standard IO (POSIX) and MPI-IO are shown. From the graph on the upper right, we can confirm that MPI-IO used collective rather than independent operations.
我们构建了支持 POSIX 和 MPI-IO 分析的运行时工具。POSIX 是 Portable Operating System Interface 的首字母缩写词,是各种系统级功能(如常规文件系统操作)的可移植性标准。对于修改后的测试,我们关闭了所有验证和其他标准 IO 操作,以便我们可以专注于代码的 MPI-IO 部分。我们还使数组更大。此测试是在用于主目录的 NFS 文件系统上完成的。在图中,我们可以看到我们同时执行了 MPI-IO 写入和读取,并且写入速度比读取速度稍慢。我们还可以看到,MPI 元数据操作的成本要高得多。文件元数据的写入会记录有关文件所在位置、权限和访问时间的信息。就其性质而言,写入元数据是一个串行操作。
We built the run-time tool with support for both POSIX and MPI-IO profiling. POSIX, an acronym for Portable Operating System Interface, is the standard for portability for a wide range of system-level functions such as regular filesystem operations. For our modified test, we turned off all of the verification and other standard IO operations so that we can focus on the MPI-IO parts of the code. We also made the arrays larger. This test was done on the NFS filesystem that is used for our home directory. In the figure, we can see that we did both an MPI-IO write and read and that the write is slightly slower than the read. We can also see that the cost of the MPI metadata operations is much higher. The writing of file metadata records information about where the file is located, its permissions, and its access times. By its nature, writing metadata is a serial operation.
Darshan 还对分析 HDF5 文件操作提供了一些支持。您可以在项目网站上获取有关 Darshan HPC I/O 表征工具的更多信息:
Darshan also has some support for profiling HDF5 file operations. You can get more information on the Darshan HPC I/O characterization tool at the project website:
https://www.mcs.anl.gov/research/projects/darshan/
https://www.mcs.anl.gov/research/projects/darshan/
软件包管理器已成为简化各种系统上软件包安装的关键工具。这些工具最初出现在带有 Red Hat 包管理器的 Linux 系统上,用于管理软件安装,但此后这些工具在许多操作系统中得到广泛应用。使用包管理器安装工具和设备驱动程序可以大大简化安装过程,并使系统更加稳定和最新。
Package managers have become critical tools for simplifying software package installation on a variety of systems. These tools first appeared on Linux systems with the Red Hat package manager to manage software installation, but these have since become widespread in many operating systems. Using package managers to install tools and device drivers can greatly simplify the installation process and keep your system more stable and up-to-date.
Linux 操作系统严重依赖使用软件包管理。您应该尽可能使用 Linux 软件包系统来安装软件。遗憾的是,并非所有软件包(尤其是供应商设备驱动程序)都设置为使用包管理器进行安装。如果不使用包管理器,软件安装会更加困难且容易出错。大多数适用于 Linux 的高性能计算软件包都以 Debian (.deb) 或 Red Hat Package Manager (.rpm) 软件包格式分发。这些包格式可以安装在大多数 Linux 发行版上。
Linux operating systems heavily rely on the use of package management. You should use your Linux package system to install software whenever possible. Unfortunately, not all software packages, and particularly vendor device drivers, are set up for installing with package managers. Without the use of a package manager, software installation is more difficult and error-prone. Most high performance computing software packages for Linux are distributed as Debian (.deb) or as Red Hat Package Manager (.rpm) package formats. These package formats can be installed on most Linux distributions.
对于 Mac 操作系统 (macOS),两个主要的包管理器是 Homebrew 和 MacPorts。通常,两者都是安装软件包的不错选择。由于 macOS 是 Berkeley Software Distribution (BSD) Unix 的衍生产品,因此可以使用许多开源工具。但是,随着最近对 macOS 的更改以提高安全性,一些工具已经放弃了对该平台最新版本的支持。随着 Mac 硬件的最新变化,包管理可能会有重大变化。有关 Homebrew 和 MacPorts 的更多信息,请访问各自的网站:
For the Mac operating system (macOS), the two major package managers are Homebrew and MacPorts. In general, both are good choices for installing software packages. Because macOS is a derivative of the Berkeley Software Distribution (BSD) Unix, many open source tools are available. But with recent changes to macOS to improve security, some tools have dropped support for the latest releases for the platform. And with recent changes to the Mac hardware, there may be significant changes to package management. More information on Homebrew and MacPorts is available at their respective websites:
https://brew.sh 的 Homebrew
Homebrew at https://brew.sh
MacPorts 在 https://www.macports.org
MacPorts at https://www.macports.org
长期以来,高度专有的 Windows 操作系统在软件安装和支持方面一直是喜忧参半。有些软件得到了很好的支持,而另一些软件则根本没有。Microsoft 正在发生变化,因为它拥抱开源运动。Windows 现在刚刚推出其新的 Windows 子系统 Linux (WSL)。WSL 在 shell 中设置 Linux 环境,并且应该允许大多数 Linux 软件无需更改即可工作。最近宣布 WSL 将支持对 GPU 的透明访问,这在高性能社区引起了兴奋。当然,主要目标是游戏和其他大众市场应用,但如果可能的话,我们很乐意顺势而为。
The heavily proprietary Windows operating system has long been a mixed bag for software installation and support. Some software has been well supported and other software not at all. Things are changing at Microsoft as it embraces the open source movement. Windows is just now coming to the party with its new Windows Subsytem Linux (WSL). WSL sets up a Linux environment within a shell and should permit most Linux software to work without changes. A recent announcement that WSL would support transparent access to the GPU has generated excitement in the high performance community. Of course, the main targets are gaming and other mass-market applications, but we’ll be happy to ride the coattails if possible.
到目前为止,我们已经讨论了专注于特定计算平台的包管理器。高性能计算工具的挑战比传统包管理器的挑战要大得多,因为需要同时支持更多的操作系统、硬件和编译器。直到 2013 年,劳伦斯利弗莫尔国家实验室的 Todd Gamblin 发布了 Spack 包管理器,才解决了这些问题。本书的一位作者向 Spack 列表贡献了几个包,当时整个系统中的包少于十几个。现在有 4,000 多个受支持的软件包,其中许多是高性能计算社区独有的。
So far, we have discussed package managers focused around specific computing platforms. The challenges of a tool for high performance computing are much greater than those for traditional package managers because of the larger number of operating systems, hardware, and compilers that need to be simultaneously supported. It took until 2013, when Todd Gamblin at Lawrence Livermore National Laboratory released the Spack package manager, to address these issues. One of this book’s authors contributed a couple of packages to the Spack list when there were fewer than a dozen packages in the whole system. Now there are over 4,000 supported packages and many of these are unique to the high performance computing community.
您将找到许多 Spack 命令。Table 17.3 提供了一些可以帮助您入门的方法。
You’ll find many Spack commands. Table 17.3 provides a few to get you started.
Spack 拥有广泛的文档和活跃的开发社区。查看他们的网站以获取最新信息:https://spack.readthedocs.io。
Spack has extensive documentation and an active development community. Check their site for up-to-date information: https://spack.readthedocs.io.
大型计算站点上的软件开发的现实情况是,这些站点必须同时支持多个环境。因此,您可以加载不同版本的 GCC 和 MPI 进行测试。您也许能够加载这些不同的开发工具链,但软件模块没有像大多数供应商发行版那样进行广泛的测试。
The realities of software development on large computing sites are that these sites have to simultaneously support multiple environments. Because of this, you can load different versions of the GCC and MPI for testing. You might be able to load these different development toolchains, but the software modules do not come with the extensive testing that is done with most vendor distributions.
警告从 Modules 包安装的工具链软件可能会出现错误。然而,高性能应用程序的优势在很大程度上是值得的,但潜在的困难是值得的。
Warning Errors with toolchain software installed from the Modules package can occur. The advantages for high performance applications, however, are largely worth the potential difficulties.
现在让我们看看您可能与使用 Modules 包一起安装的工具链系统使用的典型命令,如表 17.4 所示。
Now let’s look at the typical commands you might use with a toolchain system installed with the Modules package as table 17.4 shows.
Table 17.4 Toolchain module commands: Quick start
|
Unloads all modules and restores the environment to before modules loaded |
|
由于模块 show 命令显示模块执行的操作,因此让我们看一下 GCC 编译器套件和 CUDA 的几个示例。
Because the module show command displays the actions executed by the module, let’s look at a couple of examples for the GCC compiler suite and for CUDA.
从这些 Modules 命令的示例中看到,这些模块只是设置了一些环境变量。这就是为什么 Modules 不是万无一失的。以下是我们从艰难的经历中学到的一些关于使用 Module 的重要提示。我们从以下内容开始:
As you can see from the examples of these Modules commands, the modules are simply setting some environment variables. This is why Modules is not foolproof. Here are some important hints for using Modules that we learned the hard way. We begin with the following:
Consistency is important. Set the same modules for compiling and running your code. If the path to the library changes, your code may crash or give you the wrong results.
Automate as much as possible. If you neglect to do so, your first build (or run) will fail before you realize you forgot to load your modules.
此外,还有其他方法可以加载模块文件。每个都充满了优点和缺点。这些方法是
Also, there are different approaches to loading module files. Each is filled with advantages and disadvantages. These approaches are
使用交互式 shell 启动脚本,而不是批处理启动脚本(例如,在 .login 文件而不是 .cshrc 中加载模块)。并行作业将其环境传播到远程节点。如果您在错误的 shell 启动脚本中加载 Modules,则远程节点的模块可能与头节点不同。这可能会产生意想不到的后果。
Use interactive shell startup scripts, not batch startup scripts (e.g., load Modules in a .login file instead of a .cshrc). Parallel jobs propagate their environment to remote nodes. If you load Modules in the wrong shell startup script, your remote nodes can have different modules than your head node. This could have unexpected consequences.
在加载模块之前,在批处理脚本中使用模块清除。如果加载了 Modules,则 Module 加载可能会因冲突而失败,从而可能导致程序失败。(请注意,在 Cray 系统上使用模块清除是不可靠的。
Use module purge in batch scripts before loading Modules. If you have Modules loaded, the module load can fail because of a conflict, potentially causing your program to fail. (Note that it is unreliable to use module purge on Cray systems.)
在程序构建中设置运行路径。通过 rpaths link 选项或其他构建机制将运行路径嵌入到可执行文件中,有助于降低应用程序对不断变化的 Modules 环境和路径的敏感度。缺点是,如果编译器不在同一位置,则应用程序可能无法在其他系统上运行。请注意,此技术无助于从 PATH 变量中获取错误的程序版本,例如 mpirun。
Set run paths in program builds. Embedding run paths in your executable through the rpaths link option or other build mechanisms, helps to make your application less sensitive to changing Modules environments and paths. The disadvantage is that your application may not run on another system if the compilers are not in the same location. Note that this technique does not help with getting the wrong version of a program such as mpirun from your PATH variable.
加载特定版本的编译器(例如,GCC v9.3.0 而不仅仅是 GCC)。通常,特定的编译器版本被设置为默认版本,但这会在某个时候发生变化,从而破坏您的应用程序或构建。此外,默认值在所有系统上都不会相同。
Load specific versions of compilers (e.g., GCC v9.3.0 rather than just GCC). Often a particular compiler version is set as default, but this will change at some point, breaking your application or build. Also, defaults are not going to be the same on all systems.
有两个主要的软件包可以实现基本的 Modules 命令。第一个叫 module,常叫 TCL modules,第二个叫 Lmod。我们将在以下各节中讨论这些内容。
There are two major software packages that implement basic Modules commands. The first is called module, often called TCL modules, and the second is Lmod. We discuss these in the following sections.
是的,这很令人困惑。Modules 包创建的类别现在或多或少地使用相同的名称 — module。1991 年,Sun Microsystems 的 John Furlani 创建了 module,然后将其作为开源软件发布。模块工具是用工具命令语言编写的,更广为人知的是 TCL。事实证明,它是主要计算中心的重要组成部分。模块文档位于 https://modules .readthedocs.io/en/stable/module.html。
Yeah, this is confusing. The Modules package created the category that now more or less uses the same name—module. In 1991, John Furlani at Sun Microsystems created module and then released it as open source software. The module tool is written in the Tool Command Language, better known as TCL. It has proven to be an essential component at major computing centers. The module document is at https://modules .readthedocs.io/en/stable/module.html.
Lmod 是一个基于 Lua 的 Modules 系统,可以动态地设置用户的环境。它是环境模块概念的较新实现。lmod 文档位于 https://lmod.readthedocs.io/en/latest。
Lmod is a Lua-based Modules system that dynamically sets up a user’s environment. It is a newer implementation of the environment modules concept. The lmod documentation is at https://lmod.readthedocs.io/en/latest.
我们希望我们有时间和空间来更详细地了解如何使用这些工具中的每一个。不幸的是,需要另一本书(甚至几本书)来探索高性能计算工具的世界。
We wish we had the time and space to go through in better detail how to use each of these tools. Unfortunately, it would take another book (even several books) to explore the world of tools for high performance computing.
我们已经介绍了一些更简单的工具,展示了它们的强大功能和实用性。就像你不应该通过封面来判断一本书一样,也不要通过花哨的界面来判断一个工具。相反,您应该查看该工具的作用以及它的易用性。我们的经验是,花哨的用户界面,而不是功能,往往成为目标。此外,工具应该简单。我们已经厌倦了面对另一个 600 页的快速入门指南来学习下一个工具。是的,该工具可能很棒并且可以做一些奇妙的事情,但应用程序开发人员还有很多其他东西需要掌握。最好的工具可以在几个小时内拿起并变得有用。
We have gone through some of the simpler tools, presenting both their power and usefulness. Just like you shouldn’t judge a book by its cover, don’t judge a tool by the fancy interface. Instead, you should look at what the tool does and how easy it is to use. Our experience has been that fancy user interfaces, instead of functionality, often become the goal. In addition, tools should be simple. We have grown weary of facing another 600-page quick start guide to just learn the next tool. Yes, the tool might be great and do wondrous things, but an application developer has a lot of other things to master as well. The best tools can be picked up and made useful in a couple of hours.
现在,我们将一些工作交给您来尝试这些工具,希望您能找到一些可以扩展开发人员工具集的工具。只需添加几个工具,您就可以成为更好、更高效的程序员。以下是一些练习,可以帮助您入门。
Now we turn some of the effort over to you to try these tools, and hopefully, you will find some that will expand your developer’s toolset. The addition of just a couple of tools makes you a better and more effective programmer. Here are a few exercises to get you started.
Run the Dr. Memory tool on one of your small codes or one of the codes from the exercises in this book.
Compile one of your codes with the dmalloc library. Run your code and view the results.
Try inserting a thread race condition into the example code in section 17.6.2 and see how Archer reports the problem.
在您的文件系统上尝试 Section 17.8 中的分析练习。如果您有多个文件系统,请在每个文件系统上尝试。然后将示例中数组的大小更改为 2000x2000。它如何改变文件系统性能结果?
Try the profiling exercise in section 17.8 on your filesystem. If you have more than one filesystem, try it on each. Then change the size of the array in the example to 2000x2000. How does it change the filesystem performance results?
Better software development practices start with version control. Creating a solid software development environment results in faster and better code development.
Use timers and profilers to measure the performance of your applications. Measuring performance is the first step towards improving application performance.
Explore the various mini-apps to see programming examples relevant to your application area. Learning from these examples will help you avoid reinventing the methods and improve your application.
Use tools that help with detecting problems in your application. This improves your program quality and robustness.
我们已经在每章的末尾提供了其他资源列表,我们建议了解有关本章中涵盖的主题的更多信息。在每一章中,我们都放置了我们认为对大多数读者最有价值的材料。本附录中的参考资料适用于对编写本书时使用的原始材料感兴趣的人。引用部分是为了感谢研究和技术报告的原作者。这些对于那些对特定主题进行更深入研究的人来说也很重要。
We have already provided a list of additional resources at the end of each chapter that we suggest for learning more about topics covered in the chapter. In each chapter, we placed the materials that we think would be most valuable to most readers. The references in this appendix are for those interested in the source materials that were used in developing the book. The citations are partially to give credit to the original authors of research and technical reports. These are also important for those conducting more in-depth research on a particular topic.
Amdahl, Gene M. “实现大规模计算能力的单处理器方法的有效性。”1967 年 4 月 18 日至 20 日春季联合计算机会议的会议记录。(1967):483-48.https://doi.org/10.1145/1465482.1465560。
Amdahl, Gene M. “Validity of the single processor approach to achieving large scale computing capabilities.” Proceedings of the April 18-20, 1967, Spring Joint Computer Conference. (1967):483-48. https://doi.org/10.1145/1465482.1465560.
Flynn, Michael J. “一些计算机组织及其有效性”。IEEE Transactions on Computers,第 C-21 卷,第 9 期(1972 年 9 月):948-960。
Flynn, Michael J. “Some Computer Organizations and Their Effectiveness.” In IEEE Transactions on Computers, Vol. C-21, no. 9 (September, 1972): 948-960.
Gustafson, John L. “重新评估阿姆达尔定律”。在 ACM 通讯中,第 31 卷,第 5 期(1988 年 5 月):532-533。http://doi.acm.org/10.1145/42411.42415。
Gustafson, John L. “Reevaluating Amdahl’s Law.” In Communications of the ACM, Vol. 31, no. 5 (May, 1988):532-533. http://doi.acm.org/10.1145/42411.42415.
Horowitz, M.、Labonte, F. 和 Rupp, K. 等人,“微处理器趋势数据”。2021 年 2 月 20 日访问。https://github.com/karlrupp/microprocess 或 trend-data。
Horowitz, M., Labonte, F., and Rupp, K., et al. “Microprocessor Trend Data.” Accessed February 20, 2021. https://github.com/karlrupp/microprocess or-trend-data.
CMake 的。https://cmake.org/。
CMake. https://cmake.org/.
经验车顶工具包 (ERT)。https://bitbucket.org/berkeleylab/cs-roof line-toolkit 中。
Empirical Roofline Toolkit (ERT). https://bitbucket.org/berkeleylab/cs-roof line-toolkit.
4Intel® Advisor. https://software.intel.com/en-us/advisor.
likwid. https://github.com/RRZE-HPC/likwid.
STREAM 下载。https://github.com/jeffhammond/Stream.git。
STREAM download. https://github.com/jeffhammond/Stream.git.
瓦尔格林德。http://valgrind.org/。
Valgrind. http://valgrind.org/.
McCalpin, J. D. “STREAM:高性能计算机中的可持续内存带宽”。2021 年 2 月 20 日访问。https://www.cs.virginia.edu/stream/。
McCalpin, J. D. “STREAM: Sustainable Memory Bandwidth in High Performance Computers.” Accessed February 20, 2021. https://www.cs.virginia.edu/stream/.
佩斯,埃尔马。“密集线性代数的性能建模和预测。”arXiv:1706.01341(2017 年 6 月)。预印本:https://arxiv.org/abs/1706.01341。
Peise, Elmar. “Performance Modeling and Prediction for Dense Linear Algebra.” arXiv:1706.01341 (June, 2017). Preprint: https://arxiv.org/abs/1706.01341.
威廉姆斯,SW,D. 帕特森等。“The Roofline Model: A pedagogical tool for auto-tuning kernels on multicore architectures.”在 Hot Chips,高性能芯片研讨会上,第 HC20 卷(2008 年 8 月 10 日)。
Williams, S. W., D. Patterson, et. al. “The Roofline Model: A pedagogical tool for auto-tuning kernels on multicore architectures.” In Hot Chips, A Symposium on High Performance Chips, Vol. HC20 (August 10, 2008).
Data-oriented design. https://github.com/dbartolini/data-oriented-design.
Bird, R. “Performance Study of Array of Structs of Arrays.” Los Alamos National Lab (LANL). Paper in preparation.
Garimella、Rao 和 Robert W. Robey。“计算物理应用的多材料数据结构的比较研究”,没有。LA-UR-16-23889。洛斯阿拉莫斯国家实验室 (LANL)(2017 年 1 月)。
Garimella, Rao, and Robert W. Robey. “A Comparative Study of Multi-material Data Structures for Computational Physics Applications,” no. LA-UR-16-23889. Los Alamos National Lab (LANL) (January, 2017).
Hennessy、John L. 和 David A. Patterson。计算机体系结构:一种定量方法。第 5 版美国加利福尼亚州旧金山:Morgan Kaufmann,2011 年。
Hennessy, John L., and David A. Patterson. Computer architecture: A Quantitative Approach. 5th ed. San Francisco, CA, USA: Morgan Kaufmann, 2011.
霍夫曼、约翰内斯、扬·艾辛格和迪特马尔·菲。“执行-缓存-内存性能模型:简介和验证。”arXiv:1509.03118(2017 年 3 月)。预印本:https://arxiv.org/abs/1509.03118。
Hofmann, Johannes, Jan Eitzinger, and Dietmar Fey. “Execution-Cache-Memory Performance Model: Introduction and Validation.” arXiv:1509.03118 (March, 2017). Preprint: https://arxiv.org/abs/1509.03118.
Hollman, David, Bryce Lelbach, H. Carter Edwards, et al. “C++ 中的 mdspan:将性能可移植功能集成到国际语言标准中的案例研究。”IEEE/ACM HPC 性能、可移植性和生产力国际研讨会 (P3HPC)(2019 年 11 月):60-70。
Hollman, David, Bryce Lelbach, H. Carter Edwards, et al. “mdspan in C++: A Case Study in the Integration of Performance Portable Features into International Language Standards.” IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (November, 2019):60-70.
Treibig、Jan 和 Georg Hager。“为带宽受限的循环内核引入性能模型。”并行处理与应用数学国际会议(2009 年 5 月):615-624。
Treibig, Jan, and Georg Hager. “Introducing a performance model for bandwidth-limited loop kernels.” International Conference on Parallel Processing and Applied Mathematics (May, 2009):615-624.
Ahrens、Peter、Hong Diep Nguyen 和 James Demmel。“高效可重现的浮点求和和 BLAS。”在加州大学伯克利分校 EECS 系,技术报告,编号。UCB/EECS-2015-229(2015 年 12 月)。
Ahrens, Peter, Hong Diep Nguyen, and James Demmel. “Efficient Reproducible Floating Point Summation and BLAS.” In EECS Department, University of California, Berkeley, Techical Report, No. UCB/EECS-2015-229 (December, 2015).
Alcantara, Dan A., Andrei Sharf, Fatemeh Abbasinejad, et al. “GPU 上的实时并行哈希。”ACM 图形汇刊 (TOG),第 28 卷,第 5 期(2009 年 12 月):154。
Alcantara, Dan A., Andrei Sharf, Fatemeh Abbasinejad, et al. “Real-time parallel hashing on the GPU.” In ACM Transactions on Graphics (TOG), Vol. 28, no. 5 (December, 2009):154.
安德森,艾丽莎。“在并行浮点点积中实现数值再现性。”(2014 年 4 月)。https://digitalcommons.csbsju.edu/hon ors_theses/30/。
Anderson, Alyssa. “Achieving Numerical Reproducibility in the Parallelized Floating Point Dot Product.” (April, 2014). https://digitalcommons.csbsju.edu/hon ors_theses/30/.
Blelloch, Guy E. “扫描为原始并行操作。”在 IEEE 计算机汇刊中,第 38 卷,第 11 期(1989 年 11 月):1526-1538。
Blelloch, Guy E. “Scans as primitive parallel operations.” In IEEE Transactions on computers, Vol. 38, no. 11 (November, 1989):1526-1538.
Blelloch, Guy E. Vector models for data-parallel computing. Cambridge, MA, USA: The MIT Press, 1990.
Chapp、Dylan、Travis Johnston 和 Michela Taufer。“关于在极端尺度上通过智能运行时选择归约算法来实现可重现数值精度的需求。”2015 年 IEEE 集群计算国际会议(2015 年 10 月):166-175。
Chapp, Dylan, Travis Johnston, and Michela Taufer. “On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale.” 2015 IEEE International Conference on Cluster Computing (October, 2015):166-175.
Cleveland, Mathew A., Thomas A. Brunner, et al. “在并行粒子蒙特卡洛模拟中,在不同数量的处理器上以双精度全局精度获得相同的结果。”在计算物理学杂志,第 251 卷(2013 年 10 月):223-236。
Cleveland, Mathew A., Thomas A. Brunner, et al. “Obtaining identical results with double precision global accuracy on different numbers of processors in parallel particle Monte Carlo simulations.” In Journal of Computational Physics, Vol. 251 (October, 2013):223-236.
哈里斯、马克、舒巴布拉塔·森古普塔和约翰·欧文斯。“使用 CUDA 的并行前缀和 (Scan)。”在 GPU Gems 3 中,第 39 期(2007 年 4 月):851-876。
Harris, Mark, Shubhabrata Sengupta, and John D. Owens. “Parallel Prefix Sum (Scan) with CUDA.” In GPU Gems 3, no. 39 (April, 2007):851-876.
莱斯利,布伦顿。“GPU 架构的数据并行哈希技术。”欧洲图形可视化会议 (EuroVis),第 37 卷,第 3 期(2018 年 7 月)。
Lessley, Brenton. “Data-Parallel Hashing Techniques for GPU Architectures.” In Eurographics Conference on Visualization (EuroVis), Vol. 37, no. 3 (July, 2018).
Hoefler、Torsten 和 Jesper Larsson Traff。“MPI 的稀疏集体运算。”2009年IEEE国际并行与分布式处理研讨会(2009年7月):18.
Hoefler, Torsten, and Jesper Larsson Traff. “Sparse collective operations for MPI.” 2009 IEEE International Symposium on Parallel & Distributed Processing (July, 2009):18.
Thakur、Rajeev 和 William Gropp。“用于评估多线程 MPI 通信性能的测试套件。”在并行计算中,第 35 卷,第 12 期(2009 年 12 月):608-617。
Thakur, Rajeev, and William Gropp. “Test suite for evaluating performance of multithreaded MPI communication.” In Parallel Computing, Vol. 35, no. 12 (December, 2009):608-617.
Yang、Charlene、Thorsten Kurth 和 Samuel Williams。“GPU 的分层屋顶线分析:加速 NERSC-9 Perlmutter 系统的性能优化。”在并发和计算:实践与经验(2019 年 11 月)中。https://doi.org/10.1002/cpe.5547。
Yang, Charlene, Thorsten Kurth, and Samuel Williams. “Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC-9 Perlmutter system.” In Concurrency and Computation: Practice and Experience (November, 2019). https://doi.org/10.1002/cpe.5547.
CUDA 工具包文档。“计算能力。”CUDA C++ 编程指南,v11.2.1(NVIDIA Corporation,2021 年)。https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities。
CUDA Toolkit Documentation. “Compute Capabilities.” CUDA C++ Programming Guide, v11.2.1 (NVIDIA Corporation, 2021). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities.
哈里斯,马克。“优化 CUDA 中的并行减少。”(NVIDIA 公司)。https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf。
Harris, Mark. “Optimizing Parallel Reduction in CUDA.” (NVIDIA Corporation). https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf.
“印度尼西亚海啸:火山如何成为触发因素。”BBC 全球新闻有限公司(2018 年 12 月)。http://mng.bz/y92d。
BBC News. “Indonesia tsunami: How a volcano can be the trigger.” BBC Global News Ltd (December, 2018). http://mng.bz/y92d.
Broquedis, François, Jérôme Clet-Ortega, et al. “hwloc:在 HPC 应用程序中管理硬件相关性的通用框架。”第 18 届 Euromicro 并行、分布式和基于网络的处理国际会议 (PDP2010) 的会议记录。IEEE 计算机学会出版社(2010 年 2 月):180-186。https://ieeexplore.ieee.org/document/5452445。
Broquedis, François, Jérôme Clet-Ortega, et al. “hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications.” Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP2010). IEEE Computer Society Press (February, 2010):180-186. https://ieeexplore.ieee.org/document/5452445.
惠普企业,原始流程贴装计划,xthi.c.CLE 用户应用程序放置指南 (CLE 5.2.UP04) S-2496,第 87 页。http://mng.bz/MgWB。
Hewlett Packard Enterprise, Original process placement program, xthi.c. CLE User Application Placement Guide (CLE 5.2.UP04) S-2496, pg 87. http://mng.bz/MgWB.
“OpenMP 应用程序编程接口”,v5.0。OpenMP 架构审查委员会(2018 年 11 月)。https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf。
“OpenMP Application Programming Interface,” v5.0. OpenMP Architecture Review Board (November, 2018). https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf.
Samuel K. Gutiérrez,“耦合、多线程消息传递程序的自适应并行”。(2018 年 12 月)。https://www.cs.unm.edu/~samuel/publications/2018/skgutierrez-dissertation.pdf。
Samuel K. Gutiérrez, “Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs.” (December, 2018). https://www.cs.unm.edu/~samuel/publications/2018/skgutierrez-dissertation.pdf.
Samuel K. Gutiérrez, Davis, Kei, et al. “在耦合并行应用中适应线程级异构性”。IEEE 国际并行和分布式处理研讨会论文集(2017 年 5 月)。https://github.com/lanl/libquo/blob/master/docs/publications/quo-ipdps17.pdf。
Samuel K. Gutiérrez, Davis, Kei, et al. “Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications.” Proceedings of the IEEE International Parallel and Distributed Processing Symposium (May, 2017). https://github.com/lanl/libquo/blob/master/docs/publications/quo-ipdps17.pdf.
斯奎尔斯,杰夫。“进程放置。”(2014 年 9 月)。2021 年 2 月 20 日访问。https://github.com/open-mpi/ompi/wiki/ProcessPlacement。
Squyres, Jeff. “Process Placement.” (September, 2014). Accessed February 20, 2021. https://github.com/open-mpi/ompi/wiki/ProcessPlacement.
Treibig, J.、G. Hager 和 G. Wellein。“LIKWID:适用于 x86 多核环境的轻量级面向性能的工具套件。”arXiv:1004.4431(2010 年 6 月)。预印本:http://arxiv.org/abs/1004.4431。
Treibig, J., G. Hager and G. Wellein. “LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments.” arXiv:1004.4431 (June, 2010). Preprint: http://arxiv.org/abs/1004.4431.
BeeGFS (领先的并行文件系统)。https://www.beegfs.io/c/。
BeeGFS (The leading parallel file system).https://www.beegfs.io/c/.
光泽®。OpenSFS 和 EOFS。http://lustre.org。
Lustre®. OpenSFS and EOFS. http://lustre.org.
OrangeFS 项目。http://www.orangefs.org。
The OrangeFS Project. http://www.orangefs.org.
Panasas PanFS 并行文件系统。https://www.panasas.com/panfs-architec ture/panfs/.
Panasas PanFS Parallel File System. https://www.panasas.com/panfs-architec ture/panfs/.
格罗普,威廉。“第 33 讲:有关并行 IO 和 MPI-IO 提示的 MPI I/O 最佳实践的更多信息。”2021 年 2 月 20 日访问。http://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture33.pdf。
Gropp, William. “Lecture 33: More on MPI I/O Best practices for parallel IO and MPI-IO hints.” Accessed February 20, 2021. http://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture33.pdf.
Mendez、Sandra、Sebastian Lührs 等人,“最佳实践指南 — 并行 I/O”。2021 年 2 月 20 日访问。https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Parallel-IO.pdf。
Mendez, Sandra, Sebastian Lührs, et al. “Best Practice Guide—Parallel I/O.” Accessed February 20, 2021. https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Parallel-IO.pdf.
Thakur、Rajeev、Ewing Lusk 和 William Gropp。ROMIO 用户指南:高性能、可移植的 MPI-IO 实现。ANL/MCS-TM-234 的。美国伊利诺伊州阿尔托内:阿贡国家实验室(1997 年 10 月)。
Thakur, Rajeev, Ewing Lusk, and William Gropp. Users guide for ROMIO: A high-performance, portable MPI-IO implementation. ANL/MCS-TM-234. Artonne, IL, USA: Argonne National Laboratory (October, 1997).
Thakur、Rajeev、William Gropp 和 Ewing Lusk。“ROMIO 中的数据筛选和集体 I/O。”论文集。Frontiers 的 99.第七届大规模并行计算前沿研讨会(1999 年 2 月):182-189。
Thakur, Rajeev, William Gropp, and Ewing Lusk. “Data sieving and collective I/O in ROMIO.” Proceedings. Frontiers’ 99. Seventh Symposium on the Frontiers of Massively Parallel Computation (February, 1999):182-189.
斯捷潘诺夫、叶夫根尼和康斯坦丁·谢列布里亚尼。“MemorySanitizer:C++ 中未初始化内存使用的快速检测器。”2015 IEEE/ACM 代码生成和优化国际研讨会 (CGO)(2015 年 2 月):46-55。
Stepanov, Evgeniy, and Konstantin Serebryany. “MemorySanitizer: fast detector of uninitialized memory use in C++.” 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (February, 2015):46-55.
在您的日常生活中还有哪些并行操作的例子?您将如何对示例进行分类?并行设计似乎针对什么进行了优化?您能否计算此示例的并行加速比?
What are some other examples of parallel operations in your daily life? How would you classify your example? What does the parallel design appear to optimize for? Can you compute a parallel speedup for this example?
Answer: Examples of parallel operations in daily life include multi-lane highways, class registration queues, and mail delivery. There are many others.
对于您的台式机、笔记本电脑或手机,您的系统的理论并行处理能力与其串行处理能力相比如何?它中存在哪些类型的并行硬件?
答: 很难渗透到营销和炒作中并找到真正的规格。大多数设备(包括手持设备)都具有多核处理器和至少一个集成图形处理器。台式机和笔记本电脑具有一些向量功能,但非常旧的硬件除外。
For your desktop, laptop, or cellphone, what is the theoretical parallel processing power of your system in comparison to its serial processing power? What kinds of parallel hardware are present in it?
Answer: It can be hard to penetrate the marketing and hype and find the real specifications. Most devices, including handheld, have multi-core processors and at least an integrated graphics processor. Desktops and laptops have some vector capabilities except for very old hardware.
您在图 1.1 的 store checkout 示例中看到了哪些并行策略?是否有一些未显示的现有并行策略?在练习 1 中的例子中怎么样?
Which parallel strategies do you see in the store checkout example in figure 1.1? Are there some present parallel strategies that are not shown? How about in your examples from exercise 1?
Answer: Multiple instruction, multiple data (MIMD), distributed data, pipeline parallelism, and out-of-order execution with specialized queues.
您有一个图像处理应用程序,它每天需要处理 1000 张图像,每张图像的大小为 4 兆字节(MiB、220 或 1048576 字节)。串行处理每张图像需要 10 分钟。您的集群由具有 16 个内核的多核节点组成,每个节点总共有 16 GB(GiB、230 字节或 1024 MB)的主内存存储。(请注意,我们使用适当的二进制术语 MiB 和 GiB,而不是 MB 和 GB,它们分别是 106 字节和 109 字节的度量术语。
答案:在单个计算节点上进行线程处理以及向量化。4MiB × 1000 = 4 Gb。但是,要一次处理 16 张图像,只需要 64 MiB,在集群的每个节点(工作站)上远低于 1 GiB。串行时间为 10 分钟× 1000 或 167 分钟,16 个内核并行时间为 10.4 分钟。向量化可以将此时间缩短到 5 分钟以下。需求增加 10 倍将使这 100 分钟。这可能没问题,但也可能需要考虑消息传递或分布式计算。
You have an image-processing application that needs to process 1,000 images daily, which are 4 mebibytes (MiB, 220 or 1,048,576 bytes) each in size. It takes 10 min in serial to process each image. Your cluster is composed of multi-core nodes with 16 cores and a total of 16 gibibytes (GiB, 230 bytes, or 1024 mebibytes) of main memory storage per node. (Note that we use the proper binary terms, MiB and GiB, rather than MB and GB, which are the metric terms for 106 and 109 bytes, respectively.)
Now customer demand increases by 10x. Does your design handle this? What changes would you have to make?
Answer: Threading on a single compute node along with vectorization. 4MiB × 1000 = 4 Gb. But to process 16 images at a time, only 64 MiB is needed, well under 1 GiB on each node (workstation) of the cluster. The time would be 10 min × 1000 or 167 min in serial and 10.4 min on 16 cores in parallel. Vectorization could reduce this to under 5 min. A demand increase of 10x would make this 100 min. This may be Ok, but it might also be time to think about message passing or distributed computing.
Intel Xeon E5-4660 处理器的热设计功率为 130 W;这是使用所有 16 个内核时的平均功耗率。Nvidia 的 Tesla V100 GPU 和 AMD 的 MI25 Radeon GPU 的热设计功率为 300 W。假设您将软件移植为使用这些 GPU 之一。您的应用程序在 GPU 上的运行速度应该比 16 核 CPU 应用程序高得多?
An Intel Xeon E5-4660 processor has a thermal design power of 130 W; this is the average power consumption rate when all 16 cores are used. Nvidia’s Tesla V100 GPU and AMD’s MI25 Radeon GPU have a thermal design power of 300 W. Suppose you port your software to use one of these GPUs. How much faster should your application run on the GPU to be considered more energy efficient than your 16-core CPU application?
答案:300 W / 130 W。它需要有 2.3 倍的加速才能提高能效。
Answer: 300 W / 130 W. It needs to have a 2.3x speedup to be more energy efficient.
您有一个在研究生期间开发的波高模拟应用程序。它是一个串行应用程序,因为它只是计划作为您论文的基础,所以您没有采用任何软件工程技术。现在,您计划将其用作许多研究人员可以使用的可用工具的起点。您的团队中还有其他三个开发人员。为此,您会在项目计划中包括哪些内容?
You have a wave height simulation application that you developed during graduate school. It is a serial application and because it was only planned to be the basis for your dissertation, you didn’t incorporate any software engineering techniques. Now you plan to use it as the starting point for an available tool that many researchers can use. You have three other developers on your team. What would you include in your project plan for this?
答案:要使用 CTest 创建测试,由于 CTest 会从命令中检测到任何错误状态,因此可以根据构建指令进行测试。有时安装 CTest 文件会从权限中去除可执行位,并导致测试失败,没有明确的错误消息。为避免这种情况,我们可以添加一个测试来检测 CTest 脚本是否可执行。在下面的代码中,$0 是具有完整路径的 CTest 脚本,因此它适用于树外构建。
enable_testing()
add_test(NAME make WORKING_DIRECTORY ${CMAKE_BINARY_DIRECTORY}
COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/build.ctest)
#!/bin/sh if [ -x $0 ] then echo "PASSED - is executable" else echo "Failed - ctest script is not executable" exit -1 fi
Answer: To create a test using CTest, because CTest detects any error status from a command, a test can be made from a build instruction. Sometimes installing the CTest files will strip the executable bit from the permissions and cause the test to fail with no clear error message. To avoid this, we can add a test to detect if the CTest script is executable. In the following code, the $0 is the CTest script with the full path so that it works for out-of-tree builds.
enable_testing()
add_test(NAME make WORKING_DIRECTORY ${CMAKE_BINARY_DIRECTORY}
COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/build.ctest)
#!/bin/sh if [ -x $0 ] then echo "PASSED - is executable" else echo "Failed - ctest script is not executable" exit -1 fi
答:您可以通过更改或添加以下行来修复清单 2.2 中的内存错误:
4 int ipos=0, ival;
7 for (int i = 0; i<10; i++){ iarray[i] = ipos; }
8 for (int i = 0; i<10; i++){
11 free(iarray);
Fix the memory errors in listing 2.2
Answer: You can fix the memory errors in listing 2.2 by changing or adding the following lines:
4 int ipos=0, ival;
7 for (int i = 0; i<10; i++){ iarray[i] = ipos; }
8 for (int i = 0; i<10; i++){
11 free(iarray);
Calculate the theoretical performance of a system of your choice. Include the peak flops, memory bandwidth, and machine balance in your calculation.
从 line-toolkit.git 下载 Roofline Toolkit https://bitbucket.org/berkeleylab/cs-roof 并测量所选系统的实际性能。
Download the Roofline Toolkit from https://bitbucket.org/berkeleylab/cs-roof line-toolkit.git and measure the actual performance of your selected system.
With the Roofline Toolkit, start with one processor and incrementally add optimization and parallelization, recording how much improvement you get at each step.
从 https://www.cs.virginia.edu/stream/ 下载 STREAM 基准测试并测量所选系统的内存带宽。
Download the STREAM benchmark from https://www.cs.virginia.edu/stream/ and measure the memory bandwidth of your selected system.
Pick one of the publicly available benchmarks or mini-apps listed in section 17.4 and generate a call graph using KCachegrind.
选择第 17.4 节中列出的公开可用的基准测试或迷你应用程序之一,并使用英特尔 Advisor 或 likwid 工具测量其算术强度。
Pick one of the publicly available benchmarks or mini-apps listed in section 17.4 and measure its arithmetic intensity with either Intel Advisor or the likwid tools.
Using the performance tools presented in this chapter, determine the average processor frequency and energy consumption for a small application.
Using some of the tools from section 2.3.3, determine how much memory an application uses.
答案:清单 B.4.1 显示了分配左下角三角形数组的代码。假设数组索引为 C,左下角元素位于 [0][0]。此外,矩阵必须是方阵。我们使用与清单 4.3 中相同的代码,但每行的 imax 长度减少了 1。请注意,三角数组中的元素数量可以通过 jmax*(imax+1)/2 计算。
ExerciseB.4.1/malloc2Dtri.c
1 #include <stdlib.h>
2 #include "malloc2Dtri.h"
3
4 double **malloc2Dtri(int jmax, int imax)
5 {
6 double **x = ❶
(double **)malloc(jmax*sizeof(double *) + ❶
7 jmax*(imax+1)/2*sizeof(double)); ❶
8
9 x[0] = (double *)(x + jmax); ❷
10
11 for (int j = 1; j < jmax; j++, imax--) { ❸
12 x[j] = x[j-1] + imax; ❹
13 }
14
15 return(x);
16 }
Write a 2D contiguous memory allocator for a lower-left triangular matrix.
Answer: Listing B.4.1 shows the code to allocate a lower-left triangular array. Assume array indexing is C, with the lower left element at [0][0]. Also, the matrix must be a square matrix. We use the same code as in listing 4.3 but with the length of imax reduced by 1 for each row. Note that the number of elements in the triangular array can be calculated by jmax*(imax+1)/2.
Listing B.4.1 Triangular matrix allocation
ExerciseB.4.1/malloc2Dtri.c
1 #include <stdlib.h>
2 #include "malloc2Dtri.h"
3
4 double **malloc2Dtri(int jmax, int imax)
5 {
6 double **x = ❶
(double **)malloc(jmax*sizeof(double *) + ❶
7 jmax*(imax+1)/2*sizeof(double)); ❶
8
9 x[0] = (double *)(x + jmax); ❷
10
11 for (int j = 1; j < jmax; j++, imax--) { ❸
12 x[j] = x[j-1] + imax; ❹
13 }
14
15 return(x);
16 }
❶ First allocate a block of memory for the row pointers and the 2D array
❷ Now assign the start of the block of memory for the 2D array after the row pointers
❸ Reduce imax by 1 each iteration
❹ Last, assign the memory location to point to for each row pointer
为 C 编写一个 2D 分配器,其内存布局方式与 Fortran 相同。
答案:假设我们想在 Fortran 中将数组寻址为 x(j,i)。该数组将在 C 中寻址为 x[i][j]。如果我们创建一个宏 #define x(j,i) x[i-1][j-1],则代码可以使用 Fortran 数组表示法。清单 4.3 中的 2D 内存分配器可以通过交换 i 和 j 以及 imax 和 jmax 来使用。下面的清单显示了生成的代码。
Exercise4.2/malloc2Dfort.c
1 #include <stdlib.h>
2 #include "malloc2Dfort.h"
3
4 double **malloc2Dfort(int jmax, int imax)
5 {
6 double **x = ❶
(double **)malloc(imax*sizeof(double *) + ❶
7 imax*jmax*sizeof(double)); ❶
8
9 x[0] = (double *)(x + imax); ❷
10
11 for (int i = 1; i < imax; i++) {
12 x[i] = x[i-1] + jmax; ❸
13 }
14
15 return(x);
16 }
Write a 2D allocator for C that lays out the memory the same way as Fortran.
Answer: Let’s assume that we want to address the array as x(j,i) in Fortran. The array will be addressed as x[i][j] in C. If we create a macro #define x(j,i) x[i-1][j-1], then the code could use the Fortran array notation. The 2D memory allocator from listing 4.3 can be used by interchanging i and j and imax and jmax. The following listing shows the resulting code.
Listing B.4.2 Triangular matrix allocation
Exercise4.2/malloc2Dfort.c
1 #include <stdlib.h>
2 #include "malloc2Dfort.h"
3
4 double **malloc2Dfort(int jmax, int imax)
5 {
6 double **x = ❶
(double **)malloc(imax*sizeof(double *) + ❶
7 imax*jmax*sizeof(double)); ❶
8
9 x[0] = (double *)(x + imax); ❷
10
11 for (int i = 1; i < imax; i++) {
12 x[i] = x[i-1] + jmax; ❸
13 }
14
15 return(x);
16 }
❶ First allocate a block of memory for the column pointers and the 2D array
❷ Now assign the start of the block of memory for the 2D array after the column pointers
❸ Last, assign the memory location to point to for each column pointer
在第 4.1 节中为 RGB 颜色模型的数组结构数组 (AoSoA) 设计一个宏。
#define VV = 4 #define color(i,C) AOSOA[(i)/VV].C[(i)%4-1] color(50,B)
Design a macro for an Array of Structure of Arrays (AoSoA) for the RGB color model in section 4.1.
Answer: We want to retrieve the data with the normal array index and color name:
#define VV = 4 #define color(i,C) AOSOA[(i)/VV].C[(i)%4-1] color(50,B)
修改以单元格为中心的全矩阵数据结构的代码,使其不使用条件并估计其性能。
答案:下图显示了删除了 if 语句的代码。从此修改后的代码中,性能模型计数如下所示:
Memops = 2 * Nc Nm + 2 * Nc = 102 M Memops
4: ave ← ave + ρ[C][m] ∗ f [C][m] # 2Nc Nm 载荷 (ρ, f )
Modify the code for the cell-centric full matrix data structure to not use a conditional and estimate its performance.
Answer: The following figure shows the code with the if statement removed. From this modified code, the performance model counts look like the following:
Memops = 2 * Nc Nm + 2 * Nc = 102 M Memops
1: for all cells, C, up to Nc do
3: for all material IDs, m, up to Nm do
4: ave ← ave + ρ[C][m] ∗ f [C][m] # 2Nc Nm loads (ρ, f )
6: ρave[C] ← ave/V[C] # Nc stores (ρave), Nc loads (V)
Performance Model = 61.0 ms. This performance estimate is slightly faster than the version with the if statement.
How would an AVX-512 vector unit change the ECM model for the stream triad?
答案:第 4.4 节中 ECM 模型的性能分析使用了 AVX-256 向量单元,该单元可以在 1 个周期内处理所有需要的浮点运算。AVX-512 仍然需要 1 个周期,但只有一半的向量单元忙碌,如果存在,可以完成两倍的工作。由于计算操作时间 TOL 保持在 1 个周期,因此性能根本不会改变。
Answer: The performance analysis with the ECM model in section 4.4 uses an AVX-256 vector unit that could process all the needed floating-point operations in 1 cycle. The AVX-512 would still need 1 cycle but would only have half of its vector units busy and could do twice the work if it were present. Because the compute operation time, TOL, remains at 1 cycle, the performance would not change at all.
对 1 mm 距离内的粒子调用灰烬羽流中的云碰撞模型。为空间哈希实现编写伪代码。此操作的复杂度顺序是什么?
1. Bin particles into 1 mm spatial bins
2. For each bin
3. For each particle, i, in the bin
4. For all other particles, j, in this bin or adjacent bins
5. if |Pi - Pj| < 1 mm
6. compute collision
A cloud collision model in an ash plume is invoked for particles within a 1 mm distance. Write pseudocode for a spatial hash implementation. What complexity order is this operation?
Answer: The pseudocode for the collision operation is as follows:
1. Bin particles into 1 mm spatial bins
2. For each bin
3. For each particle, i, in the bin
4. For all other particles, j, in this bin or adjacent bins
5. if |Pi - Pj| < 1 mm
6. compute collision
The operation is O(N2) in the local region, but as the mesh grows larger, the distance between the particles does not have to be computed for larger regions, thus, the operation approaches O(N ).
How are spatial hashes used by the postal service?
Answer: Zip codes. The hashing function encodes the state and region in the first three digits with the remaining two encoding first the large towns and then alphabetical order for the rest.
大数据使用 map-reduce 算法来高效处理大型数据集。它与这里介绍的哈希概念有何不同?
答案:尽管针对不同的问题域和规模开发了 map-reduce 算法中的 map 操作,但它是一个哈希值。因此,它们都执行一个哈希步骤,然后执行第二个本地操作。空间哈希具有 bin 之间距离关系的概念,而 map-reduce 本质上没有。
Big data uses a map-reduce algorithm for efficient processing of large data sets. How is it different than the hashing concepts presented here?
Answer: Although developed for different problem domains and scales, the map operation in the map-reduce algorithm is a hash. So these both do a hashing step followed by a second local operation. The spatial hash has a concept of a distance relationship between bins, whereas the map-reduce intrinsically does not.
波浪模拟代码使用 AMR 网格来更好地细化海岸线。模拟要求是记录浮标和岸上设施所在指定位置的波高与时间的关系。因为细胞一直在被提炼,所以您如何实现这一点?
答案:创建一个完美的空间哈希,其 bin 大小与最小单元格相同,并将单元格索引存储在单元格下面的 bin 中。计算每个工作站的 bin 并从 bin 中获取像元索引。
A wave simulation code uses an AMR mesh to better refine the shoreline. The simulation requirements are to record the wave heights versus time for specified locations where buoys and shore facilities are located. Because the cells are constantly being refined, how could you implement this?
Answer: Create a perfect spatial hash with the bin size the same as the smallest cell and store the cell index in the bins underlying the cell. Calculate the bin for each station and get the cell index from the bin.
在第 4.3 节 (https://github.com/LANL/MultiMatTest.git) 中尝试从多材料代码中自动向量化循环。添加 vectorization 和 loop report 标志,看看编译器会告诉您什么。
Experiment with auto-vectorizing loops from the multimaterial code in section 4.3 (https://github.com/LANL/MultiMatTest.git). Add the vectorization and loop report flags and see what your compiler tells you.
Add OpenMP SIMD pragmas to help the compiler vectorize loops to the loop you selected in the first exercise.
对于其中一个向量内部示例,将向量长度从 4 个双精度值更改为 8 宽向量宽度。查看本章的源代码,了解 8 宽实现的工作代码示例。
答:在 kahan_fog_vector.cpp 中,将 4s 改为 8s,并将 Vec4d 改为 Vec8d。将 mprefer-vector-width=512 -DMAX_VECTOR_SIZE=512 添加到 CXXFLAGS。更改后的代码和 Makefile 包含在本章的源代码中。
For one of the vector intrinsic examples, change the vector length from four double precision values to an eight-wide vector width. Check the source code for this chapter for examples of working code for eight-wide implementations.
Answer: In kahan_fog_vector.cpp, change 4s to 8s and change Vec4d to Vec8d. Add mprefer-vector-width=512 -DMAX_VECTOR_SIZE=512 to CXXFLAGS. The changed code and Makefile are included in the source code for this chapter.
如果您使用的是较旧的 CPU,练习 3 中的程序是否成功运行?性能有什么影响?
答案:对于 Intel 256 位向量单元,Intel 内部函数不起作用,必须注释掉。但是,GCC 和 Fog 版本仍然有效。2017 年 Mac 笔记本电脑的计时结果显示了 Agner Fog 向量类库的优势,八宽向量比四宽向量产生更好的结果。相比之下,8 宽向量的 GCC 实现比 4 宽版本慢。这是输出:
SETTINGS INFO -- ncells 1073741824 log 30
Initializing mesh with Leblanc problem, high values first
relative diff runtime Description
8.423e-09 1.461642 Serial sum
0 3.283697 Kahan sum with double double accumulator
4 wide vectors serial sum
-3.356e-09 0.408654 Serial sum (OpenMP SIMD pragma)
-3.356e-09 0.407457 Intel vector intrinsics Serial sum
-3.356e-09 0.402928 GCC vector intrinsics Serial sum
-3.356e-09 0.406626 Fog C++ vector class Serial sum
4 wide vectors Kahan sum
0 0.872013 Intel Vector intrinsics Kahan sum
0 0.873640 GCC vector extensions Kahan sum
0 0.872774 Fog C++ vector class Kahan sum
8 wide vector serial sum
-1.986e-09 1.467707 8 wide GCC vector intrinsic Serial sum
-1.986e-09 0.586075 8 wide Fog C++ vector class Serial sum
8 wide vector Kahan sum
-1.388e-16 1.914804 8 wide GCC vector extensions Kahan sum
-1.388e-16 0.545128 8 wide Fog C++ vector class Kahan sum
-1.388e-16 0.687497 Agner C++ vector class Kahan sum
If you are on an older CPU, does your program from exercise 3 successfully run? What is the performance impact?
Answer: For Intel 256-bit vector units, the Intel intrinsics do not work and must be commented out. The GCC and Fog versions still work, however. The timing results from a 2017 Mac laptop show the superiority of Agner Fog’s vector class library with the eight-wide vectors producing better results than the four-wide. In contrast, the GCC implementation for the eight-wide vector is slower than the four-wide version. Here's the output:
SETTINGS INFO -- ncells 1073741824 log 30
Initializing mesh with Leblanc problem, high values first
relative diff runtime Description
8.423e-09 1.461642 Serial sum
0 3.283697 Kahan sum with double double accumulator
4 wide vectors serial sum
-3.356e-09 0.408654 Serial sum (OpenMP SIMD pragma)
-3.356e-09 0.407457 Intel vector intrinsics Serial sum
-3.356e-09 0.402928 GCC vector intrinsics Serial sum
-3.356e-09 0.406626 Fog C++ vector class Serial sum
4 wide vectors Kahan sum
0 0.872013 Intel Vector intrinsics Kahan sum
0 0.873640 GCC vector extensions Kahan sum
0 0.872774 Fog C++ vector class Kahan sum
8 wide vector serial sum
-1.986e-09 1.467707 8 wide GCC vector intrinsic Serial sum
-1.986e-09 0.586075 8 wide Fog C++ vector class Serial sum
8 wide vector Kahan sum
-1.388e-16 1.914804 8 wide GCC vector extensions Kahan sum
-1.388e-16 0.545128 8 wide Fog C++ vector class Kahan sum
-1.388e-16 0.687497 Agner C++ vector class Kahan sum
按照第 7.2.2 节中的步骤,将清单 7.8 中的向量添加示例转换为高级 OpenMP。
答案:转换为高级 OpenMP 后,我们最终会得到以下清单中所示的代码,其中只有一个 pragma 来打开并行区域。
ExerciseB.7.1/vecadd.c
11 int main(int argc, char *argv[]){
12 #pragma omp parallel
13 {
14 double time_sum;
15 struct timespec tstart;
16 int thread_id = omp_get_thread_num();
17 int nthreads = omp_get_num_threads();
18 if (thread_id == 0){
19 printf("Running with %d thread(s)\n",nthreads);
20 }
21 int tbegin = ARRAY_SIZE * ( thread_id ) / nthreads;
22 int tend = ARRAY_SIZE * ( thread_id + 1 ) / nthreads;
23
24 for (int i=tbegin; i<tend; i++) {
25 a[i] = 1.0;
26 b[i] = 2.0;
27 }
28
29 if (thread_id == 0) cpu_timer_start(&tstart);
30 vector_add(c, a, b, ARRAY_SIZE);
31 if (thread_id == 0) {
32 time_sum += cpu_timer_stop(tstart);
33 printf("Runtime is %lf msecs\n", time_sum);
34 }
35 }
36 }
37
38 void vector_add(double *c, double *a, double *b, int n)
39 {
40 int thread_id = omp_get_thread_num();
41 int nthreads = omp_get_num_threads();
42 int tbegin = n * ( thread_id ) / nthreads;
43 int tend = n * ( thread_id + 1 ) / nthreads;
44 for (int i=tbegin; i < tend; i++){
45 c[i] = a[i] + b[i];
46 }
47 }
Convert the vector add example in listing 7.8 into a high-level OpenMP following the steps in section 7.2.2.
Answer: Converting to high-level OpenMP, we end up with the code shown in the following listing with just a single pragma to open the parallel region.
Listing B.7.1 High-level OpenMP
ExerciseB.7.1/vecadd.c
11 int main(int argc, char *argv[]){
12 #pragma omp parallel
13 {
14 double time_sum;
15 struct timespec tstart;
16 int thread_id = omp_get_thread_num();
17 int nthreads = omp_get_num_threads();
18 if (thread_id == 0){
19 printf("Running with %d thread(s)\n",nthreads);
20 }
21 int tbegin = ARRAY_SIZE * ( thread_id ) / nthreads;
22 int tend = ARRAY_SIZE * ( thread_id + 1 ) / nthreads;
23
24 for (int i=tbegin; i<tend; i++) {
25 a[i] = 1.0;
26 b[i] = 2.0;
27 }
28
29 if (thread_id == 0) cpu_timer_start(&tstart);
30 vector_add(c, a, b, ARRAY_SIZE);
31 if (thread_id == 0) {
32 time_sum += cpu_timer_stop(tstart);
33 printf("Runtime is %lf msecs\n", time_sum);
34 }
35 }
36 }
37
38 void vector_add(double *c, double *a, double *b, int n)
39 {
40 int thread_id = omp_get_thread_num();
41 int nthreads = omp_get_num_threads();
42 int tbegin = n * ( thread_id ) / nthreads;
43 int tend = n * ( thread_id + 1 ) / nthreads;
44 for (int i=tbegin; i < tend; i++){
45 c[i] = a[i] + b[i];
46 }
47 }
编写例程以获取数组中的最大值。添加 OpenMP 编译指示以将线程并行性添加到例程中
答案:reduction 例程使用 reduction(max:xmax) 子句,如下面的清单所示。
ExerciseB.7.2/max_reduction.c
1 #include <float.h>
2 double array_max(double* restrict var, int ncells)
3 {
4 double xmax = DBL_MIN;
5 #pragma omp parallel for reduction(max:xmax)
6 for (int i = 0; i < ncells; i++){
7 if (var[i] > xmax) xmax = var[i];
8 }
9 }
Write a routine to get the maximum value in an array. Add an OpenMP pragma to add thread parallelism to the routine
Answer: The reduction routine uses the reduction(max:xmax) clause as the following listing shows.
Listing B.7.2 OpenMP max reduction
ExerciseB.7.2/max_reduction.c
1 #include <float.h>
2 double array_max(double* restrict var, int ncells)
3 {
4 double xmax = DBL_MIN;
5 #pragma omp parallel for reduction(max:xmax)
6 for (int i = 0; i < ncells; i++){
7 if (var[i] > xmax) xmax = var[i];
8 }
9 }
在上一个练习中编写 reduction 的高级 OpenMP 版本。
答案:在高级 OpenMP 中,我们手动划分数据。数据分解在清单 B.7.3 的第 6-9 行中完成。线程 0 在第 13 行分配xmax_thread共享数据数组。第 18-22 行查找每个线程的最大值,并将结果存储在 xmax_thread 数组中。然后,在第 26-30 行,一个线程找到所有线程的最大值。
ExerciseB.7.3/max_reduction.c
1 #include <stdlib.h>
2 #include <float.h>
3 #include <omp.h>
4 double array_max(double* restrict var, int ncells)
5 {
6 int nthreads = omp_get_num_threads();
7 int thread_id = omp_get_thread_num();
8 int tbegin = ncells * ( thread_id ) / nthreads;
9 int tend = ncells * ( thread_id + 1 ) / nthreads;
10 static double xmax;
11 static double *xmax_thread;
12 if (thread_id == 0){
13 xmax_thread = malloc(nthreads*sizeof(double));
14 xmax = DBL_MIN;
15 }
16 #pragma omp barrier
17
18 double xmax_thread_private = DBL_MIN;
19 for (int i = tbegin; i < tend; i++){
20 if (var[i] > xmax_thread_private) xmax_thread_private = var[i];
21 }
22 xmax_thread[thread_id] = xmax_thread_private;
23
24 #pragma omp barrier
25
26 if (thread_id == 0){
27 for (int tid=0; tid < nthreads; tid++){
28 if (xmax_thread[tid] > xmax) xmax = xmax_thread[tid];
29 }
30 }
31
32 #pragma omp barrier
33
34 if (thread_id == 0){
35 free(xmax_thread);
36 }
37 return(xmax);
38 }
Write a high-level OpenMP version of the reduction in the previous exercise.
Answer: In high-level OpenMP, we manually divide up the data. The data decomposition is done in lines 6-9 in listing B.7.3. Thread 0 allocates the xmax_thread shared data array on line 13. Lines 18-22 find the maximum value for each thread and store the result in the xmax_thread array. Then, on lines 26-30, one thread finds the maximum across all the threads.
Listing B.7.3 High-level OpenMP
ExerciseB.7.3/max_reduction.c
1 #include <stdlib.h>
2 #include <float.h>
3 #include <omp.h>
4 double array_max(double* restrict var, int ncells)
5 {
6 int nthreads = omp_get_num_threads();
7 int thread_id = omp_get_thread_num();
8 int tbegin = ncells * ( thread_id ) / nthreads;
9 int tend = ncells * ( thread_id + 1 ) / nthreads;
10 static double xmax;
11 static double *xmax_thread;
12 if (thread_id == 0){
13 xmax_thread = malloc(nthreads*sizeof(double));
14 xmax = DBL_MIN;
15 }
16 #pragma omp barrier
17
18 double xmax_thread_private = DBL_MIN;
19 for (int i = tbegin; i < tend; i++){
20 if (var[i] > xmax_thread_private) xmax_thread_private = var[i];
21 }
22 xmax_thread[thread_id] = xmax_thread_private;
23
24 #pragma omp barrier
25
26 if (thread_id == 0){
27 for (int tid=0; tid < nthreads; tid++){
28 if (xmax_thread[tid] > xmax) xmax = xmax_thread[tid];
29 }
30 }
31
32 #pragma omp barrier
33
34 if (thread_id == 0){
35 free(xmax_thread);
36 }
37 return(xmax);
38 }
为什么我们不能像在虚影交换中的 send/receive 中分别使用清单 8.20 和 8.21 中的 pack 或 array buffer 方法那样阻止接收呢?
答案:使用 pack 或 array 缓冲区的版本会计划发送,但在复制或发送数据之前返回。MPI_Isend 的标准是,“在调用非阻塞发送操作后,发送方不应修改发送缓冲区的任何部分,直到发送完成。pack 和 array 版本在通信后解除分配缓冲区。因此,这些版本可能会在复制缓冲区之前删除缓冲区,从而导致程序崩溃。为了安全起见,在删除缓冲区之前,必须检查发送的状态。
Why can’t we just block on receives as was done in the send/receive in the ghost exchange using the pack or array buffer methods in listings 8.20 and 8.21, respectively?
Answer: The version using the pack or array buffers schedules the send, but returns before the data is copied or sent. The standard for the MPI_Isend says, “The sender should not modify any part of the send buffer after a nonblocking send operation is called, until the send completes.” The pack and array versions deallocate the buffers after the communication. So these versions might delete the buffers before these are copied, causing the program to crash. To be safe, the status of the send must be checked before the buffer is deleted.
如清单 8.8 所示,在 ghost exchange 的 vector 类型版本中阻止 receive 是否安全?如果我们只阻止接收有什么好处?
答案:向量版本从原始数组发送数据,而不是创建副本。这比分配缓冲区的版本更安全,缓冲区将被解除分配。如果我们只阻止接收,通信会更快。
Is it safe to block on receives as shown in listing 8.8 in the vector type version of the ghost exchange? What are the advantages if we only block on receives?
Answer: The vector version sends the data from the original arrays instead of making a copy. This is safer than the versions that allocate a buffer, which will be deallocated. If we only block on receives, the communication can be faster.
修改清单 8.21 中的 ghost cell exchange 向量类型示例,以使用阻塞接收而不是 waitall。它更快吗?它总是有效吗?
答:即使使用幻影细胞交换的向量版本,我们也必须小心不要修改仍在发送过程中的缓冲区。当我们不送角球时,这种情况发生的几率可能很小。但它仍然可能发生。为了绝对安全,我们需要在更改数组之前检查发送是否完成。
Modify the ghost cell exchange vector type example in listing 8.21 to use blocking receives instead of a waitall. Is it faster? Does it always work?
Answer: Even with the vector version of the ghost cell exchange, we have to be careful that we do not modify the buffers that are still in the process of being sent. The odds of this happening can be small when we are not sending corners. But it still can occur. To be absolutely safe, we need to check for completion of the sends before changing the arrays.
尝试将其中一个 ghost exchange 例程中的 explicit 标记替换为 MPI_ANY_TAG。它有效吗?它更快吗?您认为使用显式标签有什么好处?
答案:将 MPI_ANY_TAG 用于 tag 参数可以正常工作。它可以稍微快一些,尽管它不太可能足够重要以至于可以测量。使用显式标记会添加另一个检查,以确定是否收到了正确的消息。
Try replacing the explicit tags in one of the ghost exchange routines with MPI_ANY_TAG. Does it work? Is it any faster? What advantage do you see in using explicit tags?
Answer: Using MPI_ANY_TAG for the tag argument works fine. It can be slightly faster though it is unlikely that it will be significant enough to be measurable. Using explicit tags adds another check that the right message is being received.
在其中一个 ghost exchange 示例中消除 synchronized timers 中的障碍。使用原始同步计时器和非同步计时器运行代码。
答案:消除计时器中的障碍应该会提供更好的性能,并允许进程更独立地运行(异步)。不过,理解 timing measurements 可能更难。
Remove the barriers in the synchronized timers in one of the ghost exchange examples. Run the code with the original synchronized timers and the unsynchronized timers.
Answer: Removing the barriers in the timers should give better performance and allow the processes to operate more independently (asynchronous). It can be more difficult to understand the timing measurements though.
Add the timer statistics from listing 8.11 to the stream triad bandwidth measurement code in listing 8.17.
应用本章(HybridMPIPlusOpenMP 目录)所附代码中的步骤,将高级 OpenMP 转换为混合 MPI 加 OpenMP 示例。在您的平台上尝试向量化、线程数和 MPI 排名。
Apply the steps to convert high-level OpenMP to the hybrid MPI plus OpenMP example in the code that accompanies the chapter (HybridMPIPlusOpenMP directory). Experiment with the vectorization, number of threads, and MPI ranks on your platform.
表 9.7 显示了 1 flop/load 应用可实现的性能。查找市场上可用的 GPU 的当前价格,并填写最后两列以获得每个 GPU 的每美元翻牌。哪个看起来最划算?如果应用程序运行时的周转时间是最重要的标准,那么最好购买哪种 GPU?
表 B.1 使用各种 GPU 的 1 flop/load 应用程序可实现的性能
Table 9.7 shows the achievable performance for a 1 flop/load application. Look up the current prices for the GPUs available on the market and fill in the last two columns to get the flop per dollar for each GPU. Which looks like the best value? If turnaround time for your application runtime is the most important criteria, which GPU would be best to purchase?
Table B.1 Achievable performance for a 1 flop/load application with various GPUs
Measure the stream bandwidth of your GPU or another selected GPU. How does it compare to the ones presented in the chapter?
Use the likwid performance tool to get the CPU power requirements for the CloverLeaf application on a system where you have access to the power hardware counters.
您有一个图像分类应用程序,将每个文件传输到 GPU 需要 5 毫秒,处理需要 5 毫秒,恢复需要 5 毫秒。在 CPU 上,每张图像的处理时间为 100 毫秒。有 100 万张图像需要处理。CPU 上有 16 个处理内核。GPU 系统会更快地完成工作吗?
CPU 上的时间 — 100 毫秒× 1,000,000/16 /1,000 = 6,250 秒
You have an image classification application that will take 5 ms to transfer each file to the GPU, 5 ms to process and 5 ms to bring back. On the CPU, the processing takes 100 ms per image. There are one million images to process. You have 16 processing cores on the CPU. Would a GPU system do the work faster?
Time on a CPU—100 ms × 1,000,000/16 /1,000 = 6,250 s
Time on a GPU—(5 ms + 5 ms + 5 ms) × 1,000,000/1,000 = 15,000 s
The GPU system would not be faster. It would take about 2.5 times as long.
问题 1 中 GPU 的传输时间基于第三代 PCI 总线。如果你能得到一个 Gen4 PCI 总线,那会如何改变设计呢?第 5 代 PCI 总线?对于图像分类,您不需要返回修改后的图像。这如何改变计算方式?
答案:第四代 PCI 总线的速度是第三代 PCI 总线的两倍。
(2.5 毫秒 + 5 毫秒 + 2.5 毫秒) × 1,000,000/1,000 = 10,000 秒
第 5 代 PCI 总线的速度是原始第 3 代 PCI 总线的四倍。
(1.25 毫秒 + 5 毫秒 + 1.25 毫秒)× 1,000,000/1,000 = 7,500 秒
The transfer time for the GPU in problem 1 is based on a third generation PCI bus. If you can get a Gen4 PCI bus, how does that change the design? A Gen 5 PCI bus? For image classification, you shouldn’t need to bring back a modified image. How does that change the calculation?
Answer: A fourth-generation PCI bus is twice as fast as a third-generation PCI bus.
(2.5 ms + 5 ms + 2.5 ms) × 1,000,000/1,000 = 10,000 s
A fifth-generation PCI bus would be four times as fast as the original third-generation PCI bus.
(1.25 ms + 5 ms + 1.25 ms) × 1,000,000/1,000 = 7,500 s
If we don’t have to transfer the results back, we are now just as fast on the GPU as on the CPU.
对于您的独立 GPU(或 NVIDIA GeForce GTX 1060,如果没有),您可以运行多大大小的 3D 应用程序?假设每个单元格有 4 个双精度变量,使用限制为 GPU 内存的一半,以便您有空间用于临时数组。如果使用单精度,情况会如何变化?
答案:NVIDIA GeForce GTX 1060 的内存大小为 6 GiB。它具有 GDDR5,具有 192 位宽总线和 8GHz 内存时钟。
(6 GiB/2/4 双精度/8 字节 × 10243)1/3 = 465 × 465 × 465 3D 网格
For your discrete GPU (or NVIDIA GeForce GTX 1060, if none), what size 3D application could you run? Assume 4 double-precision variables per cell and a usage limit of half the GPU memory so you have room for temporary arrays. How does this change if you use single precision?
Answer: An NVIDIA GeForce GTX 1060 has a memory size of 6 GiB. It has GDDR5 with a 192-bit wide bus and 8GHz memory clock.
(6 GiB/2/4 doubles/8bytes × 10243)1/3 = 465 × 465 × 465 3D mesh
(6 GiB/2/4floats/4bytes × 10243)1/3 = 586 × 586 × 586 3D mesh
If we are dividing up our computational domain into this 3D mesh, this is a 25% improvement in resolution.
查找可用于本地 GPU 系统的编译器。OpenACC 和 OpenMP 编译器都可用吗?如果没有,您是否可以访问任何允许您尝试这些基于 pragma 的语言的系统?
Find what compilers are available for your local GPU system. Are both OpenACC and OpenMP compilers available? If not, do you have access to any systems that would allow you to try out these pragma-based languages?
从本地 GPU 开发系统上的 OpenACC/StreamTriad 和/或 OpenMP/StreamTriad 目录运行流三元组示例。您可以在 https://github.com/EssentialsofParallelComputing/Chapter11 中找到这些目录。
Run the stream triad examples from the OpenACC/StreamTriad and/or the OpenMP/StreamTriad directories on your local GPU development system. You’ll find these directories at https://github.com/EssentialsofParallelComputing/Chapter11.
将练习 2 的结果与 https://uob-hpc.github.io/BabelStream/results/ 的 BabelStream 结果进行比较。对于流三元组,移动的字节数为 3 * nsize * sizeof(datatype)。
答案:来自 NVIDIA V100 GPU 一章中的性能结果
3 × 20,000,000 × 8 字节/.586 毫秒)× (1000 毫秒/秒) / (1,000,000,000 字节/GB) = 819 GB/秒
Compare your results from exercise 2 to BabelStream results at https://uob-hpc.github.io/BabelStream/results/. For the stream triad, the bytes moved are 3 * nsize * sizeof(datatype).
Answer: From the performance results in the chapter for the NVIDIA V100 GPU
3 × 20,000,000 × 8 bytes/.586 ms) × (1000 ms/s) / (1,000,000,000 bytes/GB) = 819 GB/s
This is about 50% greater than the peak shown for the BabelStream benchmark for the NVIDIA P100 GPU.
修改清单 11.16 中的 OpenMP 数据区域映射,以反映内核中数组的实际使用情况。
答案:这些数组仅在 GPU 上使用,因此可以在其中分配这些数组,并在最后删除。因此,这些变化是
13 #pragma omp target enter data map(alloc:a[0:nsize], b[0:nsize], c[0:nsize]) 36 #pragma omp target exit data map(delete:a[0:nsize], b[0:nsize], c[0:nsize])
Modify the OpenMP data region mapping in listing 11.16 to reflect the actual use of the arrays in the kernels.
Answer: The arrays are only used on the GPU, so these can be allocated there and deleted at the end. Therefore, the changes are
13 #pragma omp target enter data map(alloc:a[0:nsize], b[0:nsize], c[0:nsize]) 36 #pragma omp target exit data map(delete:a[0:nsize], b[0:nsize], c[0:nsize])
The full listing of this change is in Stream_par7.c in the examples for the chapter.
ExerciseB.11.5/mass_sum.c
1 #include "mass_sum.h"
2 #define REAL_CELL 1
3
4 double mass_sum(int ncells, int* restrict celltype,
5 double* restrict H, double* restrict dx, double* restrict dy){
6 double summer = 0.0;
7 #pragma omp target teams distribute \
parallel for simd reduction(+:summer)
8 for (int ic=0; ic<ncells ; ic++) {
9 if (celltype[ic] == REAL_CELL) {
10 summer += H[ic]*dx[ic]*dy[ic];
11 }
12 }
13 return(summer);
14 }
Implement the mass sum example from listing 11.4 in OpenMP.
Answer: We just need to change the one pragma as the following listing shows.
Listing B.11.5 GPU version of OpenMP
ExerciseB.11.5/mass_sum.c
1 #include "mass_sum.h"
2 #define REAL_CELL 1
3
4 double mass_sum(int ncells, int* restrict celltype,
5 double* restrict H, double* restrict dx, double* restrict dy){
6 double summer = 0.0;
7 #pragma omp target teams distribute \
parallel for simd reduction(+:summer)
8 for (int ic=0; ic<ncells ; ic++) {
9 if (celltype[ic] == REAL_CELL) {
10 summer += H[ic]*dx[ic]*dy[ic];
11 }
12 }
13 return(summer);
14 }
对于大小为 20,000,000 的 x 和 y 数组,请同时使用 OpenMP 和 OpenACC 查找数组的最大半径。使用双精度值初始化数组,x 数组从 1.0 线性增加到 2.0e7,y 数组从 2.0e7 线性增加到 1.0。
答案:下面的列表显示了使用 OpenACC 查找最大半径的可能实现。
清单 B.11.6 Max Radius 的 OpenACC 版本
ExerciseB.11.6/MaxRadius.c or Chapter11/OpenACC/MaxRadius/MaxRadius.c
1 #include <stdio.h>
2 #include <math.h>
3 #include <openacc.h>
4
5 int main(int argc, char *argv[]){
6 int ncells = 20000000;
7 double* restrict x = acc_malloc(ncells * sizeof(double));
8 double* restrict y = acc_malloc(ncells * sizeof(double));
9
10 double MaxRadius = -1.0e30;
11 #pragma acc parallel deviceptr(x, y)
12 {
13 #pragma acc loop
14 for (int ic=0; ic<ncells; ic++) {
15 x[ic] = (double)(ic+1);
16 y[ic] = (double)(ncells-ic);
17 }
18
19 #pragma acc loop reduction(max:MaxRadius)
20 for (int ic=0; ic<ncells ; ic++) {
21 double radius = sqrt(x[ic]*x[ic] + y[ic]*y[ic]);
22 if (radius > MaxRadius) MaxRadius = radius;
23 }
24 }
25 printf("Maximum Radius is %lf\n",MaxRadius);
26
27 acc_free(x);
28 acc_free(y);
29 }
For x and y arrays of size 20,000,000, find the maximum radius for the arrays using both OpenMP and OpenACC. Initialize the arrays with double-precision values that linearly increase from 1.0 to 2.0e7 for the x array and decrease from 2.0e7 to 1.0 for the y array.
Answer: The following listing shows a possible implementation of finding the maximum radius using OpenACC.
Listing B.11.6 OpenACC version of Max Radius
ExerciseB.11.6/MaxRadius.c or Chapter11/OpenACC/MaxRadius/MaxRadius.c
1 #include <stdio.h>
2 #include <math.h>
3 #include <openacc.h>
4
5 int main(int argc, char *argv[]){
6 int ncells = 20000000;
7 double* restrict x = acc_malloc(ncells * sizeof(double));
8 double* restrict y = acc_malloc(ncells * sizeof(double));
9
10 double MaxRadius = -1.0e30;
11 #pragma acc parallel deviceptr(x, y)
12 {
13 #pragma acc loop
14 for (int ic=0; ic<ncells; ic++) {
15 x[ic] = (double)(ic+1);
16 y[ic] = (double)(ncells-ic);
17 }
18
19 #pragma acc loop reduction(max:MaxRadius)
20 for (int ic=0; ic<ncells ; ic++) {
21 double radius = sqrt(x[ic]*x[ic] + y[ic]*y[ic]);
22 if (radius > MaxRadius) MaxRadius = radius;
23 }
24 }
25 printf("Maximum Radius is %lf\n",MaxRadius);
26
27 acc_free(x);
28 acc_free(y);
29 }
更改 CUDA 流三元组示例中的主机内存分配以使用固定内存(清单 12.1-12.6)。您的性能是否有所提升?
答:要获得固定内存,请将主机端内存分配中的 malloc 替换为 cudaHostMalloc,将空闲内存替换为 cudaFreeHost,如清单 B.12.1 所示。在列表中,我们只显示需要更改的行。将性能与 CUDA/StreamTriad 目录中的 chapter12 中的代码进行比较。使用固定内存时,数据传输时间应至少快两倍。
ExerciseB.12.1/StreamTriad.cu
31 // allocate host memory and initialize
32 double *a, *b, *c;
33 cudaMallocHost(&a,stream_array_size*sizeof(double));
34 cudaMallocHost(&b,stream_array_size*sizeof(double));
35 cudaMallocHost(&c,stream_array_size*sizeof(double));
< ... steam triad code ... >
86 cudaFreeHost(a);
87 cudaFreeHost(b);
88 cudaFreeHost(c);
Change the host memory allocation in the CUDA stream triad example to use pinned memory (listings 12.1-12.6). Did you get a performance improvement?
Answer: To get pinned memory, replace malloc in the host-side memory allocation with cudaHostMalloc and the free memory with cudaFreeHost as listing B.12.1 shows. In the listing, we only display the lines that need to be changed. Compare the performance to the code from chapter12 in CUDA/StreamTriad directory. The data transfer time should be at least a factor of two times faster with pinned memory.
Listing B.12.1 Pinned memory version of stream triad
ExerciseB.12.1/StreamTriad.cu
31 // allocate host memory and initialize
32 double *a, *b, *c;
33 cudaMallocHost(&a,stream_array_size*sizeof(double));
34 cudaMallocHost(&b,stream_array_size*sizeof(double));
35 cudaMallocHost(&c,stream_array_size*sizeof(double));
< ... steam triad code ... >
86 cudaFreeHost(a);
87 cudaFreeHost(b);
88 cudaFreeHost(c);
对于总和缩减示例,请尝试 18,000 个元素的数组大小,所有元素都初始化为其索引值。运行 CUDA 代码,然后运行 SumReductionRevealed 中的版本。您可能需要调整打印的信息量。
For the sum reduction example, try an array size of 18,000 elements all initialized to their index value. Run the CUDA code and then the version in SumReductionRevealed. You may want to adjust the amount of information printed.
对于清单 12.20 中的 SYCL 示例,在 GPU 设备上初始化 a 和 b 数组。
答案:以下列表显示了在 GPU 上初始化了 a 和 b 数组的版本。
14 // host data
15 vector<double> a(nsize);
16 vector<double> b(nsize);
17 vector<double> c(nsize);
18
19 t1 = chrono::high_resolution_clock::now();
20
21 Sycl::queue Queue(sycl::cpu_selector{});
22
23 const double scalar = 3.0;
24
25 Sycl::buffer<double,1> dev_a { a.data(), Sycl::range<1>(a.size()) };
26 Sycl::buffer<double,1> dev_b { b.data(), Sycl::range<1>(b.size()) };
27 Sycl::buffer<double,1> dev_c { c.data(), Sycl::range<1>(c.size()) };
28
29 Queue.submit([&](sycl::handler& CommandGroup) {
30
31 auto a =
dev_a.get_access<Sycl::access::mode::write>(CommandGroup);
32 auto b =
dev_b.get_access<Sycl::access::mode::write>(CommandGroup);
33 auto c =
dev_c.get_access<Sycl::access::mode::write>(CommandGroup);
34
35 CommandGroup.parallel_for<class StreamTriad>(
Sycl::range<1>{nsize}, [=] (Sycl::id<1> it) {
36 a[it] = 1.0;
37 b[it] = 2.0;
38 c[it] = -1.0;
39 });
40 });
41 Queue.wait();
42
43 Queue.submit([&](sycl::handler& CommandGroup) {
44
45 auto a = dev_a.get_access<Sycl::access::mode::read>(CommandGroup);
46 auto b = dev_b.get_access<Sycl::access::mode::read>(CommandGroup);
47 auto c =
dev_c.get_access<Sycl::access::mode::write>(CommandGroup);
48
49 CommandGroup.parallel_for<class StreamTriad>(
Sycl::range<1>{nsize}, [=] (Sycl::id<1> it) {
50 c[it] = a[it] + scalar * b[it];
51 });
52 });
53 Queue.wait();
54
55 t2 = chrono::high_resolution_clock::now();
For the SYCL example in listing 12.20, initialize the a and b arrays on the GPU device.
Answer: The following listing shows a version with the a and b arrays initialized on the GPU.
Listing B.12.4 Initializing arrays a and b in SYCL
14 // host data
15 vector<double> a(nsize);
16 vector<double> b(nsize);
17 vector<double> c(nsize);
18
19 t1 = chrono::high_resolution_clock::now();
20
21 Sycl::queue Queue(sycl::cpu_selector{});
22
23 const double scalar = 3.0;
24
25 Sycl::buffer<double,1> dev_a { a.data(), Sycl::range<1>(a.size()) };
26 Sycl::buffer<double,1> dev_b { b.data(), Sycl::range<1>(b.size()) };
27 Sycl::buffer<double,1> dev_c { c.data(), Sycl::range<1>(c.size()) };
28
29 Queue.submit([&](sycl::handler& CommandGroup) {
30
31 auto a =
dev_a.get_access<Sycl::access::mode::write>(CommandGroup);
32 auto b =
dev_b.get_access<Sycl::access::mode::write>(CommandGroup);
33 auto c =
dev_c.get_access<Sycl::access::mode::write>(CommandGroup);
34
35 CommandGroup.parallel_for<class StreamTriad>(
Sycl::range<1>{nsize}, [=] (Sycl::id<1> it) {
36 a[it] = 1.0;
37 b[it] = 2.0;
38 c[it] = -1.0;
39 });
40 });
41 Queue.wait();
42
43 Queue.submit([&](sycl::handler& CommandGroup) {
44
45 auto a = dev_a.get_access<Sycl::access::mode::read>(CommandGroup);
46 auto b = dev_b.get_access<Sycl::access::mode::read>(CommandGroup);
47 auto c =
dev_c.get_access<Sycl::access::mode::write>(CommandGroup);
48
49 CommandGroup.parallel_for<class StreamTriad>(
Sycl::range<1>{nsize}, [=] (Sycl::id<1> it) {
50 c[it] = a[it] + scalar * b[it];
51 });
52 });
53 Queue.wait();
54
55 t2 = chrono::high_resolution_clock::now();
将清单 12.24 中 Raja 示例中的两个初始化循环转换为 Raja:forall 语法。尝试使用 CUDA 运行示例。
答案:初始化循环需要以下清单中所示的更改。然后,按照与 Section 12.5.2 相同的方式构建和运行流三元组代码。
清单 B.12.5 将 Raja 添加到流三元组的初始化循环中
ExerciseB.12.5/StreamTriad.cc
19 RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::RangeSegment(0,
nsize), [=] (int i) {
20 a[i] = 1.0;
21 b[i] = 2.0;
22 });
Convert the two initialization loops in the Raja example in listing 12.24 to the Raja:forall syntax. Try running the example with CUDA.
Answer: The initialization loop needs the changes shown in the following listing. Then the stream triad code is built and run the same way as in section 12.5.2.
Listing B.12.5 Adding Raja to the initialization loop of stream triad
ExerciseB.12.5/StreamTriad.cc
19 RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::RangeSegment(0,
nsize), [=] (int i) {
20 a[i] = 1.0;
21 b[i] = 2.0;
22 });
With these changes, the run time compared to the original version in section 12.5.2 drops from around 6.59 ms to 1.67 ms.
在 STREAM Triad 示例上运行 nvprof。您可以尝试第 12 章中的 CUDA 版本或第 11 章中的 OpenACC 版本。您对硬件资源使用了什么工作流?如果您无法访问 NVIDIA GPU,可以使用其他分析工具吗?
Run nvprof on the STREAM Triad example. You might try the CUDA version from chapter 12 or the OpenACC version from chapter 11. What workflow did you use for your hardware resources? If you don’t have access to an NVIDIA GPU, can you use another profiling tool?
Generate a trace from nvprof and import it into NVVP. Where is the run time spent? What could you do to optimize it?
Download a prebuilt Docker container from the appropriate vendor for your system. Start up the container and run one of the examples from chapter 11 or 12.
Generate a visual image of a couple of different hardware architectures. Discover the hardware characteristics for these devices.
Answer: Use the lstopo tool to generate an image of your architecture.
For your hardware, run the test suite using the script in listing 14.4. What do you discover about how to best use your system?
更改第 14.3 节中的向量加法 (vecadd_opt3.c) 示例中使用的程序,以包含更多浮点运算。获取内核并将循环中的运算更改为毕达哥拉斯公式:
c[i] = sqrt(a[i] * a[i] + b[i] * b[i]);
Change the program used in the vector addition (vecadd_opt3.c) example in section 14.3 to include more floating-point operations. Take the kernel and change the operations in the loop to the Pythagorean formula:
c[i] = sqrt(a[i] * a[i] + b[i] * b[i]);
How do your results and conclusions about the best placement and bindings change? Do you see benefit from hyperthreads now (if you have those)?
对于第 14.4 节中的 MPI 示例,请包括向量 add kernel 并为内核生成缩放图。然后将内核替换为练习 3 中使用的毕达哥拉斯公式。
For the MPI example in section 14.4, include the vector add kernel and generate a scaling graph for the kernel. Then replace the kernel with the Pythagorean formula used in exercise 3.
Replace the kernel with the Pythagorean formula used in exercise 3.
将 vector add 和 Pythagorean 公式组合到以下例程中(在单个循环或两个单独的循环中)以获得更多的数据重用:
c[i] = a[i] + b[i]; d[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
Combine the vector add and Pythagorean formula in the following routine (either in a single loop or two separate loops) to get more data reuse:
c[i] = a[i] + b[i]; d[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
How does this change the results of the placement and binding study?
Add code to set the placement and affinity within the application from one of the previous exercises.
尝试提交几个作业,一个具有 32 个处理器,另一个具有 16 个处理器。检查这些内容是否已提交以及它们是否正在运行。删除 32 处理器作业。检查它是否已被删除。
Try submitting a couple of jobs, one with 32 processors and one with 16 processors. Check to see that these are submitted and whether they are running. Delete the 32 processor job. Check to see that it got deleted.
修改自动重启脚本,使第一个作业是为计算设置的预处理步骤,而重启是用于运行仿真的。
答案:要插入预处理步骤,我们需要插入另一个条件情况,如下面的列表第 31-36 行所示,然后使用 PREPROCESS_DONE 文件来指示预处理已完成。
ExerciseB.15.2/Preprocess_then_restart.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 4
4 #SBATCH --signal=23@160
5 #SBATCH -t 00:08:00
6
7 # Do not place bash commands before the last SBATCH directive
8 # Behavior can be unreliable
9
10 NUM_CPUS=4
11 OUTPUT_FILE=run.out
12 EXEC_NAME=./testapp
13 MAX_RESTARTS=4
14
15 if [ -z ${COUNT} ]; then
16 export COUNT=0
17 fi
18
19 ((COUNT++))
20 echo "Restart COUNT is ${COUNT}"
21
22 if [ ! -e DONE ]; then
23 if [ -e RESTART ]; then
24 echo "=== Restarting ${EXEC_NAME} ===" >> ${OUTPUT_FILE}
25 cycle=`cat RESTART`
26 rm -f RESTART
27 elif [ -e PREPROCESS_DONE ]; then
28 echo "=== Starting problem ===" >> ${OUTPUT_FILE}
29 cycle=""
30 else
31 echo "=== Preprocessing data for problem ===" >> ${OUTPUT_FILE}
32 mpirun -n ${NUM_CPUS} ./preprocess_data &>> ${OUTPUT_FILE}
33 date > PREPROCESS_DONE
34 sbatch \ ❶
--dependency=afterok:${SLURM_JOB_ID} \ ❶
<preprocess_then_restart.sh ❶
35 exit
36 fi
37
38 echo "=== Submitting restart script ===" >> ${OUTPUT_FILE}
39 sbatch \ ❷
--dependency=afterok:${SLURM_JOB_ID} \ ❷
<preprocess_then_restart.sh ❷
40
41 mpirun -n ${NUM_CPUS} ${EXEC_NAME} ${cycle} &>> ${OUTPUT_FILE}
42 echo "Finished mpirun" >> ${OUTPUT_FILE}
43
44 if [ ${COUNT} -ge ${MAX_RESTARTS} ]; then
45 echo "=== Reached maximum number of restarts ===" >> ${OUTPUT_FILE}
46 date > DONE
47 fi
48 fi
通常,预处理步骤需要不同数量的处理器。在这种情况下,我们可以使用单独的批处理脚本进行预处理,如下面的清单所示。
ExerciseB.15.2/Preprocess_batch.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 1
5 #SBATCH -t 01:00:00
6
7 sbatch --dependency=afterok:${SLURM_JOB_ID} <batch_restart.sh
9
10 mpirun -n 4 ./preprocess &> preprocess.out
Modify the automatic restart script so that the first job is a preprocessing step to set up for the computation and the restarts are for running the simulation.
Answer: To insert a preprocessing step, we need to insert another conditional case as the following listing shows on lines 31-36 and then use the PREPROCESS_DONE file to indicate that the preprocessing has been done.
Listing B.15.2a Inserting preprocessing step and then automatically restarting
ExerciseB.15.2/Preprocess_then_restart.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 4
4 #SBATCH --signal=23@160
5 #SBATCH -t 00:08:00
6
7 # Do not place bash commands before the last SBATCH directive
8 # Behavior can be unreliable
9
10 NUM_CPUS=4
11 OUTPUT_FILE=run.out
12 EXEC_NAME=./testapp
13 MAX_RESTARTS=4
14
15 if [ -z ${COUNT} ]; then
16 export COUNT=0
17 fi
18
19 ((COUNT++))
20 echo "Restart COUNT is ${COUNT}"
21
22 if [ ! -e DONE ]; then
23 if [ -e RESTART ]; then
24 echo "=== Restarting ${EXEC_NAME} ===" >> ${OUTPUT_FILE}
25 cycle=`cat RESTART`
26 rm -f RESTART
27 elif [ -e PREPROCESS_DONE ]; then
28 echo "=== Starting problem ===" >> ${OUTPUT_FILE}
29 cycle=""
30 else
31 echo "=== Preprocessing data for problem ===" >> ${OUTPUT_FILE}
32 mpirun -n ${NUM_CPUS} ./preprocess_data &>> ${OUTPUT_FILE}
33 date > PREPROCESS_DONE
34 sbatch \ ❶
--dependency=afterok:${SLURM_JOB_ID} \ ❶
<preprocess_then_restart.sh ❶
35 exit
36 fi
37
38 echo "=== Submitting restart script ===" >> ${OUTPUT_FILE}
39 sbatch \ ❷
--dependency=afterok:${SLURM_JOB_ID} \ ❷
<preprocess_then_restart.sh ❷
40
41 mpirun -n ${NUM_CPUS} ${EXEC_NAME} ${cycle} &>> ${OUTPUT_FILE}
42 echo "Finished mpirun" >> ${OUTPUT_FILE}
43
44 if [ ${COUNT} -ge ${MAX_RESTARTS} ]; then
45 echo "=== Reached maximum number of restarts ===" >> ${OUTPUT_FILE}
46 date > DONE
47 fi
48 fi
❶ Submits first calculation job after preprocess
Often the preprocessing step needs a different number of processors. In this case, we can use a separate batch script for the preprocessing, shown in the following listing.
Listing B.15.2b Smaller preprocessing step and then automatic restart
ExerciseB.15.2/Preprocess_batch.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 1
5 #SBATCH -t 01:00:00
6
7 sbatch --dependency=afterok:${SLURM_JOB_ID} <batch_restart.sh
9
10 mpirun -n 4 ./preprocess &> preprocess.out
修改清单 15.1 中用于 Slurm 和 15.2 for PBS 中的简单批处理脚本,以通过删除名为 simulation_database 的文件来清理失败。
答案:更改 Slurm 批处理脚本以检查命令的状态并删除仿真数据库。有几种不同的方法可以进行清理。这里有三个。清单 B.15.3a 和 b 中的前两个命令使用 mpirun 命令的退出代码。
列出 Max Radius 的 B.15.3a OpenACC 版本
ExerciseB.15.3/batch_simple_error.sh 1 #!/bin/sh 2 #SBATCH -N 1 3 #SBATCH -n 4 5 #SBATCH -t 01:00:00 6 7 mpirun -n 4 ./testapp &> run.out || \ ❶ rm -f simulation_database ❶
清单 B.15.3b OpenACC 版本的 Max Radius
ExerciseB.15.3/batch_simple_error.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 4
5 #SBATCH -t 01:00:00
6
7 mpirun -n 4 ./testapp &> run.out
8 STATUS=$?
9 if [ ${STATUS} != “0” ]; then
10 rm -f simulation_database
11 fi
清单 B.15.3.b 中的第三个版本通过依赖关系标志使用批处理作业的状态条件来调用清理作业。处理的错误类型与前两种方法不同。
清单 B.15.3b OpenACC 版本的 Max Radius
ExerciseB.15.3/batch.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 4
5 #SBATCH -t 01:00:00
6
7 sbatch --dependency=afternotok:${SLURM_JOB_ID} <batch_cleanup.sh
9
10 mpirun -n 4 ./testapp &> run.out
ExerciseB.15.3/batch_cleanup.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 1
5 #SBATCH -t 00:10:00
6 rm -f simulation_database
Modify the simple batch script in listing 15.1 for Slurm and 15.2 for PBS to clean up on failure by removing a file called simulation_database.
Answer: Change the Slurm batch script to check the status of the command and remove the simulation database. There are several different ways to do the cleanup. Here are three. The first two in listings B.15.3a and b use the exit code from the mpirun command.
Listing B.15.3a OpenACC version of Max Radius
ExerciseB.15.3/batch_simple_error.sh 1 #!/bin/sh 2 #SBATCH -N 1 3 #SBATCH -n 4 5 #SBATCH -t 01:00:00 6 7 mpirun -n 4 ./testapp &> run.out || \ ❶ rm -f simulation_database ❶
❶ The || symbol executes the command for non-zero status values
Listing B.15.3b OpenACC version of Max Radius
ExerciseB.15.3/batch_simple_error.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 4
5 #SBATCH -t 01:00:00
6
7 mpirun -n 4 ./testapp &> run.out
8 STATUS=$?
9 if [ ${STATUS} != “0” ]; then
10 rm -f simulation_database
11 fi
The third version in listing B.15.3.b uses the status condition of the batch job through a dependency flag to invoke a cleanup job. The types of errors that are handled are different than the first two methods.
Listing B.15.3b OpenACC version of Max Radius
ExerciseB.15.3/batch.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 4
5 #SBATCH -t 01:00:00
6
7 sbatch --dependency=afternotok:${SLURM_JOB_ID} <batch_cleanup.sh
9
10 mpirun -n 4 ./testapp &> run.out
ExerciseB.15.3/batch_cleanup.sh
1 #!/bin/sh
2 #SBATCH -N 1
3 #SBATCH -n 1
5 #SBATCH -t 00:10:00
6 rm -f simulation_database
Check for the hints available on your system using the techniques described in section 16.6.1.
在具有更大数据集的系统上尝试 MPI-IO 和 HDF5 示例,看看您可以实现什么性能。将其与 IOR 微观基准进行比较以获得额外积分。
Try the MPI-IO and HDF5 examples on your system with much larger datasets to see what performance you can achieve. Compare that to the IOR micro benchmark for extra credit.
Use the h5ls and h5dump utilities to explore the HDF5 data file created by the example.
Run the Dr. Memory tool on one of your small codes or one of the codes from the exercises in this book.
Compile one of your codes with the dmalloc library. Run your code and view the results.
Try inserting a thread race condition into the example code in section 17.6.2 and see how Archer reports the problem.
在您的文件系统上尝试 Section 17.8 中的分析练习。如果您有多个文件系统,请在每个文件系统上尝试。然后将示例中数组的大小更改为 2000x2000。它如何改变文件系统性能结果?
Try the profiling exercise in section 17.8 on your filesystem. If you have more than one filesystem, try it on each one. Then change the size of the array in the example to 2000x2000. How does it change the filesystem performance results?
3DNow! An AMD vector instruction set that first supported single-precision operations.
关联 为特定硬件组件分配进程、等级或线程放置的首选项。这也称为固定或绑定。
Affinity Assigning a preference for the placement of a process, rank, or thread to a particular hardware component. This is also called pinning or binding.
算法复杂度 (Algorithmic complexity) 完成算法所需的操作数的度量。算法复杂性是算法的一个属性,是过程中工作量或操作量的度量。
Algorithmic complexity A measure of the number of operations that it would take to complete an algorithm. Algorithmic complexity is a property of the algorithm and is a measure of the amount of work or operations in a procedure.
别名 指针指向重叠的内存区域。在这种情况下,编译器无法判断它是否是相同的内存,在这些情况下,生成向量化代码或其他优化是不安全的。
Aliasing Where pointers point to overlapping regions of memory. In this situation, the compiler cannot tell if it is the same memory, and in these instances, it would be unsafe to generate vectorized code or other optimizations.
反流依赖性 循环中的变量在读取后写入,称为先读后写 (WAR)。
Anti-flow dependency A variable within the loop is written after being read, known as a write-after-read (WAR).
算术强度 相对于应用程序或内核(循环)执行的内存负载(数据)的浮点运算 (flops) 的数量。算术强度是了解应用程序限制特征的重要度量。
Arithmetic intensity The number of floating-point operations (flops) relative to the memory loads (data) that your application or kernel (loop) performs. The arithmetic intensity is an important measure to understand the limiting characteristics of an application.
渐近表示法 指定性能限制界限的表达式。基本上,运行时间是随着问题的大小而线性增长还是更糟?该符号使用各种形式的 O,例如 O(n)、O(n log2n) 或 O(n2)。O 可以被认为是“顺序”,就像“按顺序的刻度”一样。
Asymptotic notation An expression that specifies the limiting bound on performance. Basically, does the run time grow linearly or worse with the size of a problem? The notation uses various forms of O, such as O(n), O(n log2n) or O(n2). The O can be thought of as “order” as in “scales on an order of.”
Asynchronous This call is non-blocking and only initiates an operation.
自动向量化 编译器对标准 C、C++ 或 Fortran 语言源代码的源代码进行向量化。
Auto-vectorization The vectorization of the source code by the compiler for standard C, C++, or Fortran language source code.
AVX 高级向量扩展 (AVX) 是一个 256 位向量硬件单元和指令集。
AVX Advanced Vector Extensions (AVX) is a 256-bit vector hardware unit and instruction set.
AVX2 对 AVX 硬件的改进,支持融合乘法加法 (FMA)。
AVX2 An improvement to AVX hardware to support fused multiply adds (FMA).
AVX512 Extends the AVX hardware to 512-bit vector widths.
带宽 数据在系统中通过给定路径的最佳速率。这可以指内存、磁盘或网络吞吐量。
Bandwidth The best rate at which data can be moved through a given path in the system. This can refer to memory, disk, or network throughput.
二进制数据格式 处理器使用并存储在主内存中的数据的计算机表示形式。通常,此术语是指数据格式在写入硬盘时保持二进制形式。
Binary data format The machine representation of the data that is used by the processor and stored in main memory. Usually this term refers to the data format staying in binary form when it is written out to the hard disk.
Blocking An operation that does not complete until a specific condition is fulfilled.
Branch miss (分支未命中) if 语句中预测的分支不正确时遇到的成本。
Branch miss The cost encountered when the predicted branch in an if statement is incorrect.
Bucket (存储桶) 保存值集合的存储位置。哈希技术用于将键的值存储在存储桶中,因为该位置可能有多个值。
Bucket A storage location holding a collection of values. Hashing techniques are used to store the values for keys in a bucket because there might be multiple values for that location.
缓存 一种较快的内存块,用于通过存储可能需要的数据块或指令来降低访问较慢的主内存的成本。
Cache A faster block of memory that is used to reduce the cost of accessing the slower main memory by storing blocks of data or instructions that might be needed.
缓存逐出 从缓存层次结构的各个级别之一中删除数据块(称为缓存行)。
Cache eviction The removal of blocks of data, called cache lines, from one of the various levels of the cache hierarchy.
Cache line The block of data loaded into cache when memory is accessed.
高速缓存未命中 当处理器尝试访问内存地址时,如果该地址不在高速缓存中,则会出现此事件。然后,系统必须从主内存检索数据,这需要 100 个周期。
Cache misses Occur when the processor tries to access a memory address and it is not in the cache. The system then has to retrieve the data from main memory at a cost of 100s of cycles of time.
缓存抖动 一种情况,即一个内存加载驱逐另一个内存加载,然后再次需要原始数据,从而导致数据加载、驱逐和重新加载。
Cache thrashing A condition where one memory load evicts another and then the original data is needed again, causing loading, eviction, and reloading of data.
高速缓存更新风暴 在多处理器系统上,当一个处理器修改另一个处理器高速缓存中的数据时,必须在其他处理器上重新加载数据。
Cache update storms On a multiprocessor system, when one processor modifies data that is in another processor's cache, the data has to be reloaded on those other processors.
Call stack 被调用的子例程的列表,这些子例程必须通过子例程末尾的 return 展开,在该子例程中跳回到上一个调用例程。
Call stack The list of called subroutines that has to be unwound by a return at the end of the subroutine where it jumps back to the previous calling routine.
Capacity misses The misses that are caused by the limited size of the cache.
灾难性抵消 两个几乎相等的数字的减法,导致结果只有几个有效数字。
Catastrophic cancellation The subtraction of two almost equal numbers, causing the result to have only a few significant digits.
Centralized version control system A version control system implemented as a single centralized system.
Checkpoint/Restart 定期写出应用程序的状态,然后在以后的作业中启动应用程序。
Checkpoint/Restart The periodic writing out of the state of an application followed by the starting up of the application in a later job.
检查点 定期将计算状态存储到磁盘的做法,以便由于批处理系统中的系统故障或有限长度的运行时间而可以重新启动计算。参见 checkpoint/restart。
Checkpointing The practice of periodically storing the state of a calculation to disk so that the calculation can be restarted due to system failures or finite length run times in a batch system. See checkpoint/restart.
Clock cycle (时钟周期) 计算机中基于系统的 clock frequency (时钟频率) 中操作之间的小时间间隔。
Clock cycle The small intervals of time between operations in the computer based on the clock frequency of the system.
Cluster A small group of distributed memory nodes connected by a commodity network.
Coalesced memory loads (合并内存加载) 将线程组中的单独内存负载组合成单个缓存行加载。
Coalesced memory loads The combination of separate memory loads from groups of threads into a single cache-line load.
粗粒度并行 一种并行性,其中处理器对大型代码块运行,同步不频繁。
Coarse-grained parallelism A type of parallelism where the processor operates on large blocks of code with infrequent synchronization.
代码覆盖率 (Code coverage) 表示执行了多少行源代码,因此通过运行测试套件“覆盖”了多少行源代码的指标。它通常表示为源代码行代码的百分比。
Code coverage A metric of how many lines of the source code are executed and, therefore, “covered” by running a test suite. It is usually expressed as a percentage of the source lines of code.
当数据写入一个处理器的缓存时,一致性错过了在多处理器之间同步缓存所需的缓存更新,该缓存也保存在另一个处理器的缓存中。
Coherency misses Cache updates needed to synchronize the caches between multiprocessors when data is written to one processor’s cache that is also held in another processor’s cache.
冷缓存 当前操作开始时,缓存中没有上一个操作中要操作的任何数据的缓存。
Cold cache A cache that does not have any of the data to be operated on in cache from a previous operation when the current operation begins.
Collisions (hash) When more than one key wants to store its value in the same bucket.
Commit tests A test suite that is run prior to committing any code to the repository.
Compact hash 压缩为较小内存大小的哈希值。紧凑哈希必须具有处理冲突的方法。
Compact hash A hash that is compressed into a smaller memory size. A compact hash must have a way to handle collisions.
Comparative speedups (比较加速) 体系结构之间的比较性能加速的缩写。这是两种硬件架构之间的相对性能,通常基于单个节点或固定的功率包络。
Comparative speedups Short for comparative performance speedups between architectures. This is the relative performance between two hardware architectures, often based on a single node or a fixed power envelope.
压缩稀疏数据结构 一种节省空间的方式,用于表示稀疏数据空间。最值得注意的示例是用于稀疏矩阵的压缩稀疏行 (CSR) 格式。
Compressed sparse data structures A space-efficient way to represent a data space that is sparse. The most notable example is the Compressed Sparse Row (CSR) format used for sparse matrices.
强制未命中 缓存未命中是在首次遇到数据时引入数据所必需的未命中。
Compulsory misses Cache misses are those that are necessary to bring in the data when it is first encountered.
计算复杂度 完成算法所需的步骤数。此复杂性度量是实现和用于计算的硬件类型的属性。
Computational complexity The number of steps needed to complete an algorithm. This complexity measure is an attribute of the implementation and the type of hardware that is being used for the calculation.
计算内核 (Computational kernel) 应用程序中的计算密集型和概念上自包含的部分。
Computational kernel A section of the application that is both computationally intensive and conceptually self-contained.
Computational mesh A collection of cells or elements that covers the simulation region.
计算设备 (OpenCL) 任何可以执行计算并支持 OpenCL 的计算硬件都是计算设备。这可能包括 GPU、CPU,甚至更奇特的硬件,例如嵌入式处理器或 FPGA。
Compute device (OpenCL) Any computational hardware that can perform computation and supports OpenCL is a compute device. This can include GPUs, CPUs, or even more exotic hardware such as embedded processors or FPGAs.
Concurrency (并发) 程序的各个部分按任意顺序运行,结果相同。并发最初是为了支持并发计算或分时,方法是在一组有限的资源上交错计算。
Concurrency The operation of parts of a program in any order with the same result. Concurrency was originally developed to support concurrent computing or timesharing by interleaving computing on a limited set of resources.
冲突未命中 (cache) 将另一个内存块加载到 CPU 仍需要的高速缓存行中而导致的未命中。
Conflict misses (cache) Misses caused by the loading of another block of memory into a cache line that is still needed by the CPU.
Contiguu memory (连续内存) 由不间断的字节序列组成的 Memory。
Contiguous memory Memory that is composed of an uninterrupted sequence of bytes.
Continuous integration An automatic testing process that is invoked with every commit to the repository.
核心核心或计算核心是系统的基本元素,用于执行数学和逻辑运算。
Core Core or computational core is the basic element of the system that does the mathematical and logical operations.
CPU 由一个或多个计算内核组成的离散处理设备(中央处理单元),放置在电路板的插槽上,以提供主要的计算操作。
CPU The discrete processing device (the central processing unit) composed of one or more computational cores that is placed on the socket of a circuit board to provide the main computational operations.
数据并行 一种并行性,其中数据在处理器或线程之间分区并并行操作。
Data parallel A type of parallelism where the data is partitioned among the processors or threads and operated on in parallel.
Dedicated GPU A GPU on a separate peripheral card. Also known as a discrete GPU.
Dereferencing 一种操作,其中从指针引用获取内存地址,以便 cache 行用于内存数据而不是指针。
Dereferencing An operation where the memory address is obtained from the pointer reference so that the cache line is for the memory data instead of for the pointer.
描述性指令和子句 这些指令为编译器提供有关以下循环构造的信息,并赋予编译器一些自由来生成最有效的实现。
Descriptive directives and clauses These directives give the compiler information about the following loop construct and give the compiler some freedom to generate the most efficient implementation.
Direct-mapped cache (直接映射缓存) 内存地址在缓存中只有一个位置可以加载的缓存。如果另一个内存块也映射到此位置,这可能会导致冲突和驱逐。请参阅N 路集关联缓存,了解可避免此问题的缓存类型。
Direct-mapped cache A cache for which a memory address has only one location in the cache where it can be loaded. This can lead to conflicts and evictions if another block of memory also maps to this location. See N-way set associative cache for a type of cache that avoids this problem.
指令 对 Fortran 编译器的指令,以帮助它解释源代码。指令的形式是以 !$ 开头的注释行。
Directive An instruction to a Fortran compiler to help it interpret the source code. The form of the instruction is a comment line starting with !$.
离散化 将计算域分解为更小的单元或单元,形成计算网格的过程。然后对每个单元或元素执行计算。
Discretization The process of breaking up a computational domain into smaller cells or elements, forming a computational mesh. Calculations are then performed on each cell or element.
分布式阵列 跨处理器分区和拆分的阵列。例如,包含 100 个值的数组可以划分到四个处理器中,每个处理器上有 25 个值。
Distributed array An array that is partitioned and split across the processors. For example, an array containing 100 values might be divided up across four processors with 25 values on each processor.
分布式计算 跨多台计算机的应用程序和松散耦合的工作流,并使用跨网络的通信来协调工作。分布式计算应用程序的示例包括通过 Internet 上的浏览器进行搜索以及与服务器上的数据库交互的多个客户端。
Distributed computing Applications and loosely coupled workflows that span multiple computers and use communication across the network to coordinate the work. Examples of distributed computing applications include searches via browsers on the internet and multiple clients interacting with a database on a server.
分布式内存 多个内存块,每个内存块都存在于自己的地址空间和控件中。
Distributed memory More than one block of memory, each existing in its own address space and control.
分布式版本控制系统 允许多个存储库数据库而不是单个集中式系统的版本控制系统。
Distributed version control system A version control system that allows multiple repository databases rather than a single centralized system.
域边界 halos 用于施加一组特定边界条件的 Halo 单元
Domain-boundary halos Halo cells used for imposing a specific set of boundary conditions
摄影向量 Fortran 中数组的元数据,由每个维度的开始、步幅和长度组成。其含义来自俚语“give me the dope on”或有关某人或某事的信息。
Dope vector The metadata for an array in Fortran composed of the start, stride, and length for each dimension. The meaning is from the slang “give me the dope on” or information on someone or something.
DRAM 动态随机存取存储器。此内存需要经常刷新其状态,并且当电源关闭时,它存储的数据将丢失。
DRAM Dynamic Random Access Memory. This memory needs to have its state refreshed frequently and the data it stores is lost when the power is turned off.
Dynamic range The range of the working set of real numbers in a problem.
细粒度并行 一种并行性,其中计算循环或其他小代码块由多个处理器或线程操作,并且可能需要频繁同步。
Fine-grained parallelism A type of parallelism where computational loops or other small blocks of code are operated on by multiple processors or threads and may need frequent synchronization.
首次触摸 数组的第一次触摸会导致内存被分配。它被分配到发生触摸的线程位置附近。在第一次触摸之前,内存仅作为虚拟内存中的一个条目存在。与虚拟内存对应的物理内存是在首次访问虚拟内存时创建的。
First touch The first touch of an array causes the memory to be allocated. It is allocated near to the thread location where the touch occurs. Prior to the first touch, the memory only exists as an entry in virtual memory. The physical memory that corresponds to the virtual memory is created when it is first accessed.
流依赖关系 循环中的变量在写入后读取,称为先写后读 (RAW)。
Flow dependency A variable within the loop is read after being written, known as a read-after-write (RAW).
FLOPs 浮点运算,例如单精度或双精度数据类型的加法、减法和乘法。
FLOPs Floating-point operations such as addition, subtraction, and multiplication on single- or double-precision data types.
Flynn 分类法 根据数据和指令是单个还是多个对计算机体系结构进行分类。
Flynn’s Taxonomy A categorization of computer architectures based on whether the data and instructions are either single or multiple.
收集内存操作 从非连续内存位置加载到缓存行或向量单元的内存。
Gather memory operation Memory loaded into a cache line or vector unit from non-contiguous memory locations.
代 (PCIe) PCI 特别兴趣小组 (PCI SIG) 是代表行业合作伙伴的小组,负责制定 PCI Express 规范,通常简称为代或代。
Generation (PCIe) The PCI Special Interest Group (PCI SIG) is a group representing industry partners that establishes a PCI Express Specification, commonly referred to as generation or gen for short.
Ghost cells 一组单元,其中包含用于本地处理器的相邻处理器数据,以便处理器可以在大块中运行,而无需发出通信调用。
Ghost cells A set of cells that contain adjacent processor(s) data for use on the local processor so that the processor can operate in large blocks without issuing communication calls.
全局总和问题 并行计算中全局总和与串行计算或在不同数量的处理器上运行的比较的差异。
Global sum issue The difference in a global sum in a parallel calculation compared to a serial or run on a different number of processors.
GNU 编译器集合 (GCC) 一个开源的、公开可用的编译器套件,包括 C、C++、Fortran 和许多其他语言。
GNU Compiler Collection (GCC) An open-source, publically available compiler suite, including C, C++, Fortran, and many other languages.
GNU's Not Unix (GNU) 一个免费的、类 Unix 的操作系统。
GNU's Not Unix (GNU) A free, Unix-like operating system.
图形用户界面 (graphical user interface, GUI) 由可视元素和交互式组件组成的界面,可以使用鼠标或其他高级输入设备进行操作。
Graphical user interface (GUI) An interface composed of visual elements and interactive components that can be manipulated with a mouse or other advanced input devices.
图形处理单元 (GPU) 或通用图形处理单元 (GPGPU),集成或离散(外部)一种主要用途是将图形绘制到计算机显示器上的设备。它由许多 streaming multiprocessors 和自己的 RAM 内存组成,能够在一个 clock cycle内执行数万个线程。
Graphics processing unit (GPU) or general-purpose graphics processing unit (GPGPU), integrated or discrete (external)A device whose primary purpose is drawing graphics to the computer monitor. It is composed of many streaming multiprocessors and its own RAM memory, capable of executing tens of thousands of threads in one clock cycle.
HAL 按字典顺序排在 IBM 前面的小型流氓计算机。HAL 是 Arthur C. Clarke 的 2001:太空漫游 中的一台虚构计算机。HAL 之所以流氓,是因为它对其指令的解释与预期不同,并造成了致命的后果。HAL 与 IBM 仅相差一个字母。HAL 的教训是要小心编程;你永远不知道结果会是什么。
HAL A small rogue computer that precedes IBM in lexicographic order. HAL is a fictional computer in Arthur C. Clarke's 2001: A Space Odyssey. HAL goes rogue because it interprets its instructions differently than intended, with deadly consequences. HAL is just one letter off from IBM. HAL’s lesson is to be careful with your programming; you never know what the results might be.
Halo cells Any set of cells surrounding a computational mesh domain.
Hang When one or more processors is waiting on an event that can never occur.
Hash or hashing (哈希或哈希) 将键映射到值的计算机数据结构。
Hash or hashing A computer data structure that maps a key to a value.
Hash load factor The number of filled buckets divided by the total number of buckets in the hash.
Hash sparsity The amount of empty space in a hash.
堆 程序的内存区域,用于为程序提供动态内存。malloc 例程和 new 运算符从此区域获取内存。第二个内存区域是堆栈内存。
Heap A region of memory for the program that is used to provide dynamic memory for the program. The malloc routines and the new operator get memory from this region. The second region of memory is stack memory.
高性能计算 (HPC) 专注于极致性能的计算。计算硬件通常耦合得更紧密。高性能计算一词在很大程度上取代了超级计算的旧命名法。
High Performance Computing (HPC) Computing that focuses on extreme performance. The computing hardware is generally more tightly coupled. The term High Performance Computing has mostly replaced the older nomenclature of supercomputing.
超线程 (Hyperthreading) 一种 Intel 技术,通过在两个线程之间共享硬件资源,使单个处理器看起来是操作系统的两个虚拟处理器。
Hyperthreading An Intel technology that makes a single processor appear to be two virtual processors to the operating system through sharing of hardware resources between two threads.
内联 (例程) 编译器在调用点插入代码以避免调用开销,而不是进行函数调用。这仅适用于较小的例程和更简单的代码。
Inline (routines) Rather than make a function call, compilers insert the code at the call point to avoid call overhead. This only works for smaller routines and for simpler code.
互连 计算节点之间的连接,也称为网络。通常,该术语是指将并行计算系统上的操作紧密耦合的更高性能网络。其中许多互连是供应商专有的,包括专门的拓扑结构,如 fat-tree、switches、torus 和 dragonfly 设计。
Interconnects The connections between compute nodes, also called a network. Generally the term refers to higher performance networks that tightly couple the operations on a parallel computing system. Many of these interconnects are vendor proprietary and include specialized topologies such as fat-tree, switches, torus, and dragonfly designs.
进程间通信 (IPC) 计算机节点上进程之间的通信。在进程之间进行通信的各种技术构成了分布式计算中客户端/服务器机制的支柱。
Inter-process communication (IPC) Communication between processes on a computer node. The various techniques to communicate between processes form the backbone of client/server mechanisms in distributed computing.
Instruction cache (指令缓存) 将指令存储在靠近处理器的快速内存中。指令可以用于内存移动,也可以用于整数或浮点运算。作的数据有自己单独的数据缓存。
Instruction cache The storage of instructions in fast memory close to the processor. Instructions can be for memory movement, or integer or floating-point operations. The data that is operated on has its own separate data cache.
Integrated GPU A graphics processor engine that is contained on the CPU.
Lambda 表达式 一个未命名的本地函数,可以分配给变量并在本地使用或传递给例程。
Lambda expressions An unnamed, local function that can be assigned to a variable and used locally or passed to a routine.
Lanes (vector lanes) 向量操作中数据的路径。对于对双精度值进行操作的 256 位 vector unit,有四个通道允许在一个 clock cycle中用一条指令同时进行四个操作。
Lanes (vector lanes) Pathways for data in a vector operation. For a 256-bit vector unit operating on double-precision values, there are four lanes allowing four simultaneous operations with one instruction in one clock cycle.
Latency (延迟) 传输数据的第一个字节或第一个字所需的时间(另请参阅 Memory latency (另请参见内存延迟))。
Latency The time required for the first byte or word of data to be transferred (see also memory latency).
负载因子 (hash) (负载因子 (hash)) 由条目填充的哈希分数。
Load factor (hash) The fraction of a hash that is filled with entries.
Machine balance(机器平衡) 计算机系统可以执行的 flops 与内存负载的比率。
Machine balance The ratio of flops to memory loads that a computer system can perform.
主内存也称为 DRAM 或 RAM,它是计算节点的大内存块。
Main memory Also called DRAM or RAM, it is the large block of memory for the compute node.
内存延迟 从内存层次结构的某个级别检索第一个字节的内存所花费的时间。
Memory latency The time it takes to retrieve the first byte of memory from a level of the memory hierarchy.
内存泄漏 分配内存但从未释放内存。Malloc 替换工具擅长捕获和报告内存泄漏。
Memory leaks Allocating memory and never freeing it. Malloc replacement tools are good at catching and reporting memory leaks.
Memory overwrites Writing to memory that is not owned by a variable in the program.
内存分页 在多用户、多应用程序操作系统中,将内存页临时移出到磁盘以便可以执行另一个过程的过程。
Memory paging In multi-user, multi-application operating systems, the process of moving memory pages temporarily out to disk so that another process can take place.
内存压力 计算内核资源需求对 GPU 内核性能的影响。寄存器压力是一个类似的术语,指的是对内核中寄存器的要求。
Memory pressure The effect of the computational kernel resource needs on performance of GPU kernels. Register pressure is a similar term, referring to demands on registers in the kernel.
方法调用 在面向对象的编程中,对对象中对对象中的数据进行操作的代码段的调用。这些小段代码称为方法,对这些代码的调用称为调用。
Method invocation In object-oriented programming, the call to a piece of code within the object that operates on data in the object. These small pieces of code are called methods and the call to these is termed an invocation.
MIMD 多指令、多数据是 Flynn 分类法的一个组成部分,由多核系统表示。
MIMD Multiple instruction, multiple data is a component of Flynn’s Taxonomy represented by a multi-core system.
Minimal perfect hash A hash with one and only one entry in each bucket.
MISD 多指令、单个数据是 Flynn 分类法的一个组成部分,用于描述高可靠性的冗余计算机或并行管道并行性。
MISD Multiple instruction, single data is a component of Flynn’s Taxonomy describing a redundant computer for high reliability or a parallel pipeline parallelism.
MMX Earliest x86 vector instruction set released by Intel.
Motherboard The main system board of a computer.
Multi-core A CPU that contains more than one computational core.
Network The connections between compute nodes over which data flows.
节点 计算集群的基本构建块,具有自己的内存和网络,用于与其他计算节点通信并运行操作系统的单个映像。
Node A basic building block of a compute cluster with its own memory and a network to communicate with other compute nodes and to run a single image of an operating system.
非一致性内存访问 (NUMA) 在某些计算节点上,内存块比其他处理器更靠近某些处理器。这种情况称为非一致性内存访问 (NUMA)。通常,当节点有两个 CPU 插槽,每个插槽都有自己的内存时,就会出现这种情况。访问另一个内存块所需的时间通常是其自身内存的两倍。
Non-Uniform Memory Access (NUMA) On some computing nodes, blocks of memory are closer to some processors than others. This situation is called Non-Uniform Memory Access (NUMA). Often this is the case when a node has two CPU sockets with each socket having its own memory. The access to the other block of memory typically takes twice the time as its own memory.
N 路设置关联高速缓存 一种高速缓存,允许将内存地址的 N 个位置映射到高速缓存中。这减少了与直接映射缓存关联的冲突和逐出。
N-way set associative cache A cache that allows N locations for a memory address to be mapped into the cache. This reduces the conflicts and evictions associated with direct-mapped cache.
基于对象的文件系统 一种基于对象而不是基于文件夹中的文件进行组织的系统。基于对象的文件系统需要一个数据库或元数据来存储描述对象的所有信息。
Object-based filesystem A system that is organized based on objects rather than based on files in a folder. An object-based filesystem requires a database or metadata to store all the information describing the object.
Operations (OPs) Operations can be integer, floating-point, or logic.
超出范围(内存访问):尝试访问超出数组边界的内存。Fence-post 检查器和一些编译器可以捕获这些错误。
Out-of-bounds (memory access) Attempting to access memory beyond the array bounds. Fence-post checkers and some compilers can catch these errors.
Output dependency A variable is written to more than once in the loop.
可分页内存 可分页到磁盘的标准内存分配。请参阅 pinned memory 以获取无法分页的替代类型。
Pageable memory Standard memory allocations that can be paged out to disk. See pinned memory for an alternative type that cannot be paged out.
Parallel algorithm (并行算法) 一种定义明确的分步计算过程,强调通过并发来解决问题。
Parallel algorithm A well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem.
Parallel computing (并行计算) 一次对多个事物进行操作的计算。
Parallel computing Computing that operates on more than one thing at a time.
并行模式 一种常见的、独立的、并发的代码组件,它以一定的频率出现在不同的场景中。这些组件本身通常不能解决感兴趣的完整问题。
Parallel pattern A common, independent, concurrent component of code that occurs in diverse scenarios with some frequency. By themselves, these components generally do not solve complete problems of interest.
Parallel speedup Performance of a parallel implementation relative to a baseline serial run.
Parallelism The operation of parts of a program across a set of resources at the same time.
模式规则 对 make 实用程序的规范,它给出了如何将具有一种后缀模式的任何文件转换为具有另一种后缀模式的文件的一般规则。
Pattern rule A specification to the make utility that gives a general rule on how to convert any file with one suffix pattern to a file with another suffix pattern.
PCI 总线 外围组件互连总线是系统板上组件之间的主要数据路径,包括 CPU、主内存和通信网络。
PCI bus Peripheral Component Interconnect bus is the main data pathway between components on the system board, including the CPU, main memory, and the communication network.
Peel loop (剥离循环) 针对未对齐的数据执行的循环,以便主循环随后具有对齐的数据。通常,如果发现数据未对齐,则会在运行时有条件地执行剥离循环。
Peel loop A loop to execute for misaligned data so that the main loop would then have aligned data. Often the peel loop is conditionally executed at run time if the data is discovered to be misaligned.
Perfect hash 没有冲突的哈希值;每个存储桶中最多有一个条目。
Perfect hash A hash where there are no collisions; there is at most one entry in each bucket.
完美嵌套的循环 在最内层循环中只有语句的循环。这意味着每个 loop 块之前或之后没有无关的语句。
Perfectly nested loops Loops that only have statements in the innermost loop. That means that there are no extraneous statements before or after each loop block.
Performance model (性能模型) 程序中的操作如何转换为代码运行时间估计值的简化表示。
Performance model A simplified representation of how the operations in a program can be converted into an estimate of the code’s run time.
固定内存 无法从 RAM 中分页的内存。它对于内存传输特别有用,因为它可以直接发送而无需创建副本。
Pinned memory Memory that cannot be paged out from RAM. It is especially useful for memory transfers because it can be directly sent without making a copy.
POSIX 标准 可移植操作系统接口 (POSIX) 标准是适用于 Unix 和类 Unix 操作系统的 IEEE 标准,旨在促进可移植性。该标准规定了操作系统应提供的基本操作。
POSIX standard The Portable Operating System Interface (POSIX) standard is an IEEE standard for Unix and Unix-like operating systems to facilitate portability. The standard specifies the basic operations that should be provided by the OS.
Pragma 对 C 或 C++ 编译器的指令,以帮助它解释源代码。指令的形式是以 #pragma 开头的预处理器语句。
Pragma An instruction to a C or C++ compiler to help it interpret the source code. The form of the instruction is a preprocessor statement starting with #pragma.
规范性指令和子句 这些是来自程序员的指令,用于具体告诉编译器要做什么。
Prescriptive directives and clauses These are directives from the programmer that tell the compiler specifically what to do.
私有变量 (OpenMP) 在 OpenMP 的上下文中,私有变量是本地变量,仅对其线程可见。
Private variable (OpenMP) In the context of OpenMP, a private variable is local and only visible to its thread.
进程 一个独立的计算单元,拥有一部分内存并控制用户空间中的资源。
Process An independent unit of computation that has ownership of a portion of memory and control over resources in user space.
Processing core 或(简称)core 能够执行算术和逻辑运算的最基本单元。
Processing core or (simply) core The most basic unit capable of performing arithmetic and logical operations.
Profilers A programming tool that measures the performance of an application.
Profiling 应用程序性能某些方面的运行时测量;最常见的是执行程序的各个部分所花费的时间。
Profiling The run-time measurement of some aspects of application performance; most commonly, the time it takes to execute parts of a program.
Race conditions (竞争条件) 一种可能出现多个结果的情况,结果取决于贡献者的时间安排。
Race conditions A situation where multiple outcomes are possible and the result is dependent on the timing of the contributors.
随机存取存储器 (RAM) 主系统存储器,无需按顺序读取数据即可检索任何需要的数据。
Random access memory (RAM) Main system memory where any needed data can be retrieved without having to read sequentially through the data.
归约运算 (reduction operation) 将 1 到 N 维的多维数组缩减为至少一个小一维且通常为标量值的任何操作。
Reduction operation Any operation where a multidimensional array from 1 to N dimensions is reduced to a least one dimension smaller and often to a scalar value.
寄存器压力 寄存器压力是指寄存器需求对 GPU 内核性能的影响。
Register pressure Register pressure refers to the effect of register needs on the performance of GPU kernels.
Regression tests Test suites that are run at periodic intervals such as nightly or weekly.
Relaxed memory model 所有处理器的主内存或缓存中的变量值不会立即更新。
Relaxed memory model The value of the variables in main memory or caches of all the processors are not updated immediately.
余数循环 在主循环之后执行的循环,用于处理对于完整向量长度来说太小的部分数据集。
Remainder loop A loop that executes after the main loop to handle a partial set of data that is too small for a full vector length.
远程过程调用 (RPC) 对系统的调用,用于执行另一个命令。
Remote procedure call (RPC) A call to the system to execute another command.
Replicated array A dataset that is duplicated across all the processors.
Scalar operation An operation on a single value or one element of an array.
分散内存操作 将 存储 从缓存行或向量单元存储到非连续的内存位置。
Scatter memory operation Store from a cache line or vector unit to non-contiguous memory locations.
共享内存 可由多个进程或执行线程访问和修改的内存块。内存块是从程序员的角度来看的。
Shared memory A block of memory that is accessible and modifiable by multiple processes or threads of execution. The block of memory is from the programmer’s perspective.
共享变量 (OpenMP) 在 OpenMP 的上下文中,共享变量可由任何线程查看和修改。
Shared variable (OpenMP) In the context of OpenMP, a shared variable is visible and modifiable by any thread.
SIMD 单指令,多数据是 Flynn 分类法的一个组成部分,用于描述并行性,例如向量化中的并行性,其中单个指令应用于多个数据项。
SIMD Single instruction, multiple data is a component of Flynn’s Taxonomy describing a parallelism such as that found in vectorization, where a single instruction is applied across multiple data items.
SIMT 单指令、多线程是 SIMD 的一种变体,其中多个线程同时对多个数据进行操作。
SIMT Single instruction, multiple thread is a variant of SIMD, where multiple threads operate concurrently on multiple data.
SISD Single instruction, single data 是 Flynn 分类法的一个组成部分,用于描述传统的串行架构。
SISD Single instruction, single data is a component of Flynn’s Taxonomy that describes a traditional serial architecture.
Socket 处理器在主板上的插入位置。主板通常是单插槽或双插槽,允许分别安装一个或两个处理器。
Socket The location where a processor is inserted on a motherboard. Motherboards normally are either single or dual socket, allowing one or two processors to be installed, respectively.
源代码存储库 用于跟踪更改的源代码的存储,可在项目的代码开发人员之间共享。
Source code repository Storage for source code that tracks changes and can be shared between a project’s code developers.
空间位置 内存中具有附近位置的数据,这些位置通常引用得很近。
Spatial locality Data with nearby locations in memory that are often referenced close together.
SSE Streaming SIMD Extensions,Intel 发布的一种向量硬件和指令集,首先支持浮点运算。
SSE Streaming SIMD Extensions, a vector hardware and instruction set released by Intel that first supported floating-point operations.
SSE2 An improved SSE instruction set that supports double-precision operations.
堆栈内存 子例程中的内存通常是通过将对象推到堆栈指针之后的堆栈上来创建的。这些通常是小的内存对象,它们仅在例程的持续时间内存在,并在指令指针跳回到上一个位置时在例程结束时消失。
Stack memory Memory within a subroutine is often created by pushing the objects onto the stack after the stack pointer. These are usually small memory objects that exist for only the duration of the routine and disappear at the end of the routine when the instruction pointer jumps back to the previous location.
Streaming kernels 以近乎最佳的方式加载数据以有效使用缓存层次结构的计算代码块。
Streaming kernels Blocks of computational code that load data in a nearly optimal way to effectively use the cache hierarchy.
流式多处理器 (SM) 通常用于描述专为流式操作而设计的 GPU 多处理器。这些是紧密耦合的对称处理器 (SMP),具有在多个线程上运行的单个指令流。
Streaming multiprocessor (SM) Usually used to describe the multiprocessors of a GPU that are designed for streaming operations. These are tightly-coupled, symmetric processors (SMP) that have a single instruction stream operating on multiple threads.
Streaming store A store of a value directly to main memory, bypassing the cache hierarchy.
步幅 (数组) 数组中索引元素之间的距离。在 C 中,在 x 维度中,数据是连续的或步幅为 1。在 y 维度中,数据的步幅为行长。
Stride (arrays) Distance between indexed elements in an array. In C, in the x dimension, the data is contiguous or a stride of 1. In the y dimension, the data has a stride of the length of the row.
超线性加速性能优于理想的强缩放曲线。发生这种情况是因为较小的数组大小适合更高级别的缓存,从而获得更好的缓存性能。
Super-linear speedup Performance that is better than the ideal strong scaling curve. This can happen because the smaller array sizes fit into a higher level of cache, resulting in better cache performance.
对称处理器 (SMP) 多核处理器的所有内核都以单指令、多线程 (SIMT) 方式协同工作。
Symmetric processors (SMP) All cores of the multicore processor operate in unison in a single-instruction, multiple-thread (SIMT) fashion.
Task 工作,它被划分为单独的部分并被分包到各个进程或线程中。
Task Work that is divided into separate pieces and parceled out to individual processes or threads.
Task parallel A form of parallelism where processors or threads work on separate tasks.
Temporal locality (时态位置) 最近引用的数据,可能在不久的将来被引用。
Temporal locality Recently referenced data that is likely to be referenced in the near future.
Test-driven development (TDD) A process of code development where the tests are created first.
测试套件 一组问题,用于执行应用程序的某些部分以保证部分代码仍然有效。
Test suite A set of problems that exercise parts of an application to guarantee that parts of the code are still working.
Thread A separate instruction pathway through a process created by having more than one instruction pointer.
紧密嵌套的循环 在 for 或 do 语句之间或循环末尾之间没有额外语句的两个或多个循环。
Tightly-nested loops Two or more loops that have no extra statements between the for or do statements or the end of the loops.
时间复杂度 时间复杂度考虑了在典型现代计算系统上操作的实际成本。对时间的最大调整是考虑内存负载和数据缓存的成本。
Time complexity Time complexity takes into account the actual cost of an operation on a typical modern computing system. The largest adjustment for time is to consider the cost of memory loads and caching of data.
Translation lookaside buffer (TLB) 将虚拟内存地址转换为物理内存的条目表。表的大小有限意味着内存中仅保留最近使用的页面位置,如果不存在,则会发生 TLB 未命中,从而导致性能严重下降。
Translation lookaside buffer (TLB) The table of entries to translate virtual memory addresses to physical memory. The limited size of the table means that only recently used page locations are held in memory, and a TLB miss occurs if it is not present, incurring a significant performance hit.
统一内存 看起来是 CPU 和 GPU 的单个地址空间的内存。
Unified memory Memory that has the appearance of being a single address space for both the CPU and the GPU.
Unit testing Testing of each individual component of a program.
Uninitialized memory Memory that is used before its values are set.
用户空间 程序的操作控制范围,使其与操作系统的职权范围隔离开来。
User space The scope of control of operations for a program such that it is isolated from the purview of the operating system.
Validated results Results of a calculation that are compared favorably to experimental or real-world data.
向量 (SIMD) 指令集 扩展常规标量处理器指令以利用向量处理器的指令集。
Vector (SIMD) instruction set The set of instructions that extend the regular scalar processor instructions to utilize the vector processor.
Vector lane (向量通道) 通过对单个数据元素的向量寄存器进行向量运算的路径,与多车道高速公路上的车道非常相似。
Vector lane A pathway through a vector operation on vector registers for a single data element much like a lane on a multi-lane freeway.
Vector length The number of operations done in a single cycle by a vector unit.
vector operation (向量运算) 对数组的两个或多个元素执行的操作,其中向处理器提供单个操作或指令。
Vector operation An operation on two or more elements of an array with a single operation or instruction being supplied to the processor.
Vector width The width of the vector unit, usually expressed in bits.
Vectorization 将操作分组在一起以便一次可以完成多个操作的过程。
Vectorization The process of grouping operations together so more than one can be done at a time.
版本控制系统 一个数据库,用于跟踪对源代码的更改,简化多个开发人员的合并,并提供一种回滚更改的方法。
Version Control System A database that tracks the changes to your source code, simplifies the merging of multiple developers, and provides a way to roll back changes.
暖缓存 当缓存在当前操作开始时,缓存中有来自上一个操作的数据要操作。
Warm cache When a cache has data to be operated on in the cache from a previous operation as the current operation begins.
Warp An alternate term for a thread workgroup.
Word (size) 正在使用的基本类型的大小。对于单精度,这是 4 个字节,对于双精度,它是 8 个字节。
Word (size) The size of the basic type being used. For single precision, this is four bytes and for double, it is eight bytes.
Workgroup A group of threads operating together with a single instruction queue.
图 C.1 配备 Intel CPU 和独立 NVIDIA GPU 的台式机主板
Figure C.1 Desktop motherboard with Intel CPU and discrete NVIDIA GPU
图 C.2 Intel CPU 安装在插槽和底部,带有 CPU 数据引脚。到 CPU 的数据传输受可以物理安装在 CPU 表面的引脚数量的限制。
Figure C.2 Intel CPU installed in socket and underside with CPU data pins. The data transfer to the CPU is limited by the number of pins that can be physically fit onto the surface of the CPU.
Accelerated Processing Unit (APU) 312, 368, 589
acc enter data create directive 473
adaptive mesh refinement (AMR) 133
address generation units (AGUs) 118
ADIOS (Adaptable Input/Output System) 568
Advanced Vector Extensions (AVX) 117, 177
changing process affinities during run time 522-524
controlling from command line 516-520
process affinity with MPI 503-511
binding processes to hardware components 511
default process placement with OpenMPI 504
mapping processes to processors or other locations 510
specifying process placement in OpenMPI 504-509
setting affinities in executable 521-522
thread affinity with OpenMP 495-503
understanding architecture 493-494
AGUs (address generation units) 118
AI (artificial intelligence) 3
hash functions, defined 131-132
performance models vs. algorithmic complexity 126-130
task-based support algorithm 250-251
ALUs (arithmetic logic units) 351
AMR (adaptive mesh refinement) 133
AoSoA (Array of Structures of Arrays) 100-101
application/software model 25-29
process-based parallelization 26-27
stream processing through specialized processors 28-29
thread-based parallelization 27-28
APU (Accelerated Processing Unit) 312, 368, 589
ARB (Architecture Review Board) 371
arithmetic logic units (ALUs) 351
结构数组。 看AOS 系列
Array of Structures. See AoS
artificial intelligence (AI) 3
assembler instructions 195-196
associativity, addressing with parallel global sum 161-166
asynchronous operations, in OpenACC 394
Atomic Weapons Establishment (AWE) 593
AVX (Advanced Vector Extensions) 117, 177
AWE (Atomic Weapons Establishment) 593
calculating machine balance between flops and 71
empirical measurement of 67-69
calculating theoretical peak 319-320
theoretical memory bandwidth 66-67
Basic Linear Algebra System (BLAS) 179
chaos of unmanaged systems 529-530
layout of batch system for busy clusters 530-531
specifying dependencies in batch scripts 543-544
submitting batch scripts 532-536
benchmark application for PCI bus 329-332
calculating machine balance between flops and bandwidth 71
calculating theoretical maximum flops 65
empirical measurement of bandwidth and flops 67-69
measuring GPU stream benchmark 321
memory hierarchy and theoretical memory bandwidth 66-67
tools for gathering system characteristics 62-64
BLAS (Basic Linear Algebra System) 179
branch prediction cost (Bc) 106
Cartesian topology, support for in MPI 292-296
cell-centric compressed sparse storage 112-114
centralized version control 38, 583
central processing unit (CPU) 7, 310
setting compiler flags 200-201
coarse-grained parallelism 228
code portability, improving 50
Collaborative Testing System (CTS) 48
collective_buffering operation 555
CoMD (molecular dynamics) mini-app 593
Compressed Sparse Row (CSR) 105
compressed sparse storage representations 112-116
cell-centric compressed sparse storage 112-114
material-centric compressed sparse storage 114-116
计算统一设备架构。 看CUDA 的
Compute Unified Device Architecture. See CUDA
Concurrent Versions System (CVS) 583
in cloud computing with GPUs 342
CPU (central processing unit) 7, 310
cross-node parallel method 22-23
CSR (Compressed Sparse Row) 105
CTest, automatic testing with 41-45
CTS (Collaborative Testing System) 48
CUDA (Compute Unified Device Architecture) 311
writing and building applications 421-429
interoperability with OpenACC 395-396
CUDPP (CUDA Data Parallel Primitives Library) 161
CVS (Concurrent Versions System) 583
DAOS (distributed application object storage) 576
defining computational kernel or operation 17-18
off-loading calculation to GPUs 21-22
Data Parallel C++ (DPCPP) compiler 449
.deb (Debian Package Manager) 607
.deb (Debian Package Manager) 607
dependency analysis, call graphs for 72-78
descriptive directives and clauses 414
differential discretized data 134
DIMMs (dual in-line memory modules) 310
directive-based GPU programming 371-416
parallel compute regions 377-383
summary of performance results for stream triad 393
generating parallel work 398-402
process to apply directives and pragmas for GPU implementation 373-374
distributed application object storage (DAOS) 576
distributed memory architecture 22-23
distributed version control 38, 582-583
DPCPP (Data Parallel C++ ) 编译器 449
DPCPP (Data Parallel C++ ) compiler 449
DRAM (Dynamic Random Access Memory) 22, 310
dual in-line memory modules (DIMMs) 310
dynamic memory requirements 343
Dynamic Random Access Memory (DRAM) 22, 310
ECM (Execution Cache Memory) 116
empirical bandwidth (BE) 61, 67
empirical machine balance (MBE) 71
EpetraBenchmarkTest mini-app 593
Exascale Project proxy apps 592-593
Execution Cache Memory (ECM) 116
FFT (Fast Fourier transform) 179
field-programmable gate arrays (FPGAs) 314, 349, 438
components of high-performance filesystem 548-549
distributed application object storage 576
General Parallel File System 575
parallel-to-serial interface 549-550
fine-grained parallelization 228
flops (floating-point operations) 59, 87
calculating machine balance between bandwidth and 71
calculating peak theoretical flops 316-318
calculating theoretical maximum 65
empirical measurement of 67-69
FMA (fused multiply-add) 65, 118
FPGAs (field-programmable gate arrays) 314, 349, 438
full matrix data representations 109-111
full matrix cell-centric storage 109-110
full matrix material-centric storage 110-111
fused multiply-add (FMA) 65, 118
gather/scatter memory load operation 119
putting order in debug printouts 273-274
sending data out to processes for work 274-276
GCC (GNU Compiler Collection) 40, 42, 374
GCP (Google Cloud Platform) 485
GCP (Google Cloud Platform) 485
general heterogeneous parallel architecture model 24-25
General Parallel File System (GPFS) 575
general-purpose graphics processing unit (GPGPU) 24, 311
performance tests of variants 297
global sum, using OpenMP threading 227
GNU Compiler Collection (GCC) 40, 42, 374
Google Cloud Platform (GCP) 485
GPFS (General Parallel File System) 575
GPGPU (general-purpose graphics processing unit) 24, 311
GPU (graphics processing unit) languages 417-459
writing and building applications 421-429
higher-level languages 452-457
writing and building applications 439-445
GPU (graphics processing unit) profiling and tools 460-487
data movement directives 473-474
OpenACC compute directives 471-473
selecting good workflow 462-463
shallow water simulation example 463-467
GPU (graphics processing unit) programming model 346-370
asynchronous computing through queues 365-366
addressing memory resources 359-360
developing plan to parallelize applications for 366-368
unstructured mesh application 368
directive-based GPU programming 371-416
process to apply directives and pragmas for GPU implementation 373-374
GPU programming abstractions 354-355
optimizing GPU resource usage 361-362
programming abstractions 348-355
inability to coordinate among tasks 349
data decomposition into independent units of work 350-352
subgroups, warps, and wavefronts 353-354
GPUs (graphics processing units) 309-345
calculating peak theoretical flops 316-318
multiple data operations by each element 316
characteristics of GPU memory spaces 318-326
calculating theoretical peak memory bandwidth 319-320
measuring GPU stream benchmark 321
roofline performance model for GPUs 322
使用 Mixbench 性能工具为工作负载 324-326 选择最佳 GPU
using mixbench performance tool to choose best GPU for workload 324-326
CPU-GPU system as accelerated computational platform 311-313
multi-GPU platforms and MPI 332-334
higher performance alternative to PCI bus 334
optimizing data movement between graphics processing units (GPUs) across network 333
benchmark application for 329-332
theoretical bandwidth of 326-329
potential benefits of GPU-accelerated platforms 334-342
cloud computing cost reduction 342
reducing time-to-solution 335-336
group information, in OpenCL 357
H5Pset_all_coll_metadata_ops 命令 561
H5Pset_all_coll_metadata_ops command 561
H5Pset_coll_metadata_write 命令 561
H5Pset_coll_metadata_write command 561
H5Sselect_hyperslab command 560
distributed memory architecture 22-23
general heterogeneous parallel architecture model 24-25
step-efficient parallel scan operation 158
work-efficient parallel scan operation 159-160
HBM2 (High-Bandwidth Memory) 319
HC (Heterogeneous Compute) compiler 349
HDF (Hierarchical Data Format) 559
HDF5 (Hierarchical Data Format v5) 559-566
Heterogeneous Compute (HC) compiler 349
Heterogeneous Interface for Portability (HIP) 349, 435
Hierarchical Data Format (HDF) 559
Hierarchical Data Format v5 (HDF5) 559-566
High-Bandwidth Memory (HBM2) 319
improving parallel scalability with 231
High Performance Computing (HPC) 10, 89
High Performance Conjugate Gradient (HPCG) 52
HIP (Heterogeneous Interface for Portability) 349, 435
HIP_ADD_EXECUTABLE command 437
host_data use_device(VAR) 指令 396
host_data use_device(var) directive 396
hot-spot analysis, call graphs for 72-78
HPC (High Performance Computing) 10, 89
HPCG (High Performance Conjugate Gradient) 52
ICD (Installable Client Driver) 439
implementation workflow step 54
independent file operation 552
Installable Client Driver (ICD) 439
inter-process communication (IPC) 27
IPC (inter-process communication) 27
Kahan summation implementation 244
LAPACK (linear algebra package) 179
Lawrence Livermore National Laboratory proxies 593
likwid (“就像我知道我在做什么”) 518, 586
likwid (“Like I Knew What I’m Doing”) 518, 586
pinning OpenMP threads 518-519
linear algebra package (LAPACK) 179
List under Mantevo suite 593-594
reduction example of global sum using OpenMP threading 227
vector addition example 220-222
Los Alamos National Laboratory proxy applications 593
lscpu 命令 64, 494, 504, 506, 510
lscpu command 64, 494, 504, 506, 510
machine-specific registers (MSR) 76
-mapby ppr:N:socket:PE=N 命令 513
-mapby ppr:N:socket:PE=N command 513
material-centric compressed sparse storage 114-116
memops (memory loads and stores) 106, 113
memory error detection and repair 594-599
compiler-based memory tools 597
compact hashing for spatial mesh operations 149-157
defining computational kernel to conduct on each element of mesh 17-18
ghost cell exchanges in 2D mesh 277-285
perfect hashing for spatial mesh operations 135-148
table lookups using spatial perfect hash 145-146
unstructured mesh applications 368
消息传递接口。 看MPI
Message Passing Interface. See MPI
MIMD (multiple instruction, multiple data) 29
miniAMR (adaptive mesh refinement) 592
Exascale Project proxy apps 592-593
Lawrence Livermore National Laboratory proxies 593
Los Alamos National Laboratory proxy applications 593
Sandia National Laboratories Mantevo suite 593-594
miniQMC (Quantum Monte Carlo) 592
MISD (multiple instruction, single data) 29
mixbench performance tool 324-326
module show <module_name> command 609
模块 swap <module_name> <module_name> 命令 610
module swap <module_name> <module_name> command 610
模块 unload <module_name> 命令 610
module unload <module_name> command 610
molecular dynamics (CoMD) mini-app 593
MPI (Message Passing Interface) 27, 42, 254-304, 491
advanced functionality 286-297
Cartesian topology support in 292-296
Ghost Cell Exchange 变体的性能测试 297
performance tests of ghost cell exchange variants 297
basics for minimal program 255-259
minimum working example 257-259
collective communication 266-276
data parallel examples 276-286
ghost cell exchanges in 2D mesh 277-285
ghost cell exchanges in 3D stencil calculation 285-286
stream triad to measure bandwidth on node 276
hybrid MPI plus OpenMP 299-302
multi-GPU platforms and 332-334
higher performance alternative to PCI bus 334
optimizing data movement between GPUs across network 333
加上 OpenMP 218-219、300-302、511-516
plus OpenMP 218-219, 300-302, 511-516
binding processes to hardware components 511
default process placement with OpenMPI 504
mapping processes to processors or other locations 510
specifying process placement in OpenMPI 504-509
send and receive commands for process-to-process communication 259-266
MPI_COMM_WORLD (MCW) 266、269、508
MPI_COMM_WORLD (MCW) 266, 269, 508
MPI_File_read_at_all command 552
MPI_File_set_view function 552
MPI_File_write_all command 552
MPI_File_write_at_all command 552
MPI-IO (MPI file operations) 551-559
mpirun 命令 257, 504, 508, 510, 515, 517, 573, 603
mpirun command 257, 504, 508, 510, 515, 517, 573, 603
MPI_Type_contiguous function 287
MPI_Type_create_hindexed功能 287
MPI_Type_create_hindexed function 287
MPI_Type_create_struct function 287
MPI_Type_create_subarray功能 287
MPI_Type_create_subarray function 287
MSR (machine-specific registers) 76
multiple instruction, multiple data (MIMD) 29
multiple instruction, single data (MISD) 29
NDRange (N-dimensional range) 350-352
face neighbor finding for unstructured meshes 152-153
使用 Spatial Perfect Hash 135-141
using spatial perfect hash 135-141
with write optimizations and compact hashing 149-151
network interface card (NIC) 299
NIC (network interface card) 299
non-contiguous bandwidth (Bnc) 61
NUMA (Non-Uniform Memory Access) 24, 210, 218, 492
NVIDIA Nsight suite 461, 476-477
NVIDIA SMI (System Management Interface) 461
NVVP (NVIDIA Visual Profiler) 461
N-way set associative cache 102
Object Storage Servers (OSSs) 575
Object Storage Targets (OSTs) 575
omp_set_num_threads() function 212
omp target enter data directive 403
omp target exit data directive 403
directive-based GPU programming 374-396
interoperability with CUDA libraries or kernels 395-396
parallel compute regions 377-383
OpenCL (Open Computing Language) 311, 438-449
writing and building OpenCL applications 439-445
OpenMP (Open Multi-Processing) 207-253
Kahan summation implementation OpenMP threading 244
stencil example with separate pass for x and y directions 240-243
threaded implementation of prefix scan algorithm 246-247
directive-based GPU programming 396-414
accessing special memory spaces 412-413
controlling kernel parameters 411
generating parallel work 398-402
hybrid threading and vectorization with 237-240
reduction example of global sum using OpenMP threading 227
vector addition example 220-222
MPI Plus 218-219、300-302、511-516
MPI plus 218-219, 300-302, 511-516
task-based support algorithm 250-251
default process placement with 504
specifying process placement in 504-509
OpenSFS (Open Scalable File Systems) 575
OSSs (Object Storage Servers) 575
OSTs (Object Storage Targets) 575
hash functions, defined 131-132
performance models vs. algorithmic complexity 126-130
step-efficient parallel scan operation 158
work-efficient parallel scan operation 159-160
application/software model for 25-29
process-based parallelization 26-27
stream processing through specialized processors 28-29
thread-based parallelization 27-28
categorizing parallel approaches 29-30
distributed memory architecture 22-23
general heterogeneous parallel architecture model 24-25
parallel speedup vs. comparative speedup 31-32
faster run time with more compute cores 9
larger problem sizes with more compute nodes 9
reasons for learning about 6-11
defining computational kernel or operation 17-18
off-loading calculation to GPUs 21-22
Parallel Data Systems Workshop (PDSW) 577
parallel development workflow 35-57
design of core data structures and code modularity 53
finding and fixing memory issues 48-50
parallel directive 233, 381, 496
step-efficient parallel scan operation 158
work-efficient parallel scan operation 159-160
parallel speedup (serial-to-parallel speedup) 31-32
parallel-to-serial interface 549-550
Parallel Virtual File System (PVFS) 576
partial differential equations (PDEs) 18
PBS (Portable Batch System) 529
PCI (Peripheral Component Interconnect) 24, 312
benchmark application for 329-332
higher performance alternative to 334
theoretical bandwidth of 326-329
PCIe (Peripheral Component Interconnect Express) 310, 326
PCI SIG (PCI Special Interest Group) 327
PDEs (partial differential equations) 18
PDF (portable document format) 606
PDSW (Parallel Data Systems Workshop) 577
calculating machine balance between flops and bandwidth 71
calculating theoretical maximum flops 65
empirical measurement of bandwidth and flops 67-69
memory hierarchy and theoretical memory bandwidth 66-67
tools for gathering system characteristics 62-64
knowing potential limits 59-61
empirical measurement of processor clock frequency and energy consumption 82-83
tracking memory during run time 83-84
algorithmic complexity vs. 126-130
compressed sparse storage representations 112-116
full matrix data representations 109-111
Peripheral Component Interconnect (PCI) 24, 312
PEs (processing elements), OpenCL 316
pgaccelinfo command 353, 375, 388
design of core data structures and code modularity 53
PnetCDF (Parallel Network Common Data Form) 568
Portable Batch System (PBS) 529
portable document format (PDF) 606
POSIX (Portable Operating System Interface) 585, 607
directive-based GPU programming 373-374
prefix sum operations (scans) 157-161
step-efficient parallel scan operation 158
threaded implementation of 246-247
work-efficient parallel scan operation 159-160
preparation workflow step 36-50
finding and fixing memory issues 48-50
automatic testing with CMake and CTest 41-45
changes in results due to parallelism 40-41
requirements of ideal testing system 48
process-based parallelization 26-27
medium-level profilers 588-589
empirical measurement of processor clock frequency and energy consumption 82-83
tracking memory during run time 83-84
PVFS (Parallel Virtual File System) 576
Quantum Monte Carlo (miniQMC) 592
queues (streams), asynchronous computing through 365-366
Radeon Open Compute platform (ROCm) 435
Red Hat Package Manager (.rpm) 607
reduction operation 122, 162, 227
getting single value from across all processes 269-273
synchronization across work groups 364-365
hierarchical hash technique for 156-157
using spatial perfect hash 142
with write optimizations and compact hashing 153-156
remote procedure call (RPC) 27
ROCm (Radeon Open Compute platform) 435
RPC (remote procedure call) 27
.rpm (Red Hat Package Manager) 607
.rpm (Red Hat Package Manager) 607
SCALAPACK (scalable linear algebra package) 179
SDE (Software Development Emulator) 软件包 81
SDE (Software Development Emulator) package 81
send and receive commands 259-266
串行到并行加速比 (Parallel Speedup) 31-32
serial-to-parallel speedup (parallel speedup) 31-32
shallow water simulation example
data movement directives 473-474
OpenACC compute directives 471-473
SIMD (single instruction, multiple data) architecture
OpenMP SIMD directives 203-205
用于资源管理的简单 Linux 实用程序 (Slurm) 529
Simple Linux Utility for Resource Management (Slurm) 529
SIMT (single instruction, multi-thread) 30, 314, 354
Slurm(用于资源管理的简单 Linux 实用程序) 529
Slurm (Simple Linux Utility for Resource Management) 529
SMs (streaming multiprocessors) 21, 24, 316, 479
SNAP (SN application proxy) 593
SoA (Structures of Arrays) 197
Software Development Emulator (SDE) package 81
spack load <package_name> 命令 609
spack load <package_name> command 609
SSE2 (Streaming SIMD Extensions) 177
ghost cell exchanges in 3D stencil calculation 285-286
with separate pass for x and y directions 240-243
step-efficient parallel scan operation 158
STL (Standard Template Library) 420
streaming multiprocessors (SMs) 21, 24, 316, 479
stream processing through specialized processors 28-29
measuring bandwidth on node 276
table lookups, using spatial perfect hash 143, 145-146
task-based support algorithm 250-251
TDD (test-driven development) 46
TDP (thermal design power) 337
automatic testing with CMake and CTest 41-45
changes in results due to parallelism 40-41
requirements of ideal testing system 48
theoretical machine balance (MBT) 71
theoretical memory bandwidth (BT) 66
global sum using OpenMP threading 227
hybrid threading with OpenMP 237-240
Kahan summation implementation with OpenMP threading 244
thread-based parallelization 27-28
threaded implementation of prefix scan algorithm 246-247
calculating peak theoretical flops 316-318
multiple data operations by each element 316
unstructured mesh boundary communications 302
assembler instructions 195-196
multiple operations with one instruction 28
OpenMP SIMD directives 203-205
vector_length(x) directive 388
version control 37-39, 582-583
centralized version control 583
distributed version control 582-583
VMs (virtual machines) 483-485
Windows Subsytem Linux (WSL) 608
work-efficient parallel scan operation 159-160